Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeder2.ecngs.de!ecngs!feeder.ecngs.de!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'string.': 0.04; 'utf-8': 0.07; 'subject:How': 0.09; 'python': 0.09; 'encodes': 0.09; 'immutable': 0.09; 'pep': 0.09; 'subject:()': 0.09; 'subject:string': 0.09; 'subject:using': 0.09; 'stored': 0.10; 'extension': 0.13; 'languages.': 0.15; '(there': 0.16; 'encodings': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'message- id:@mrabarnett.plus.com': 0.16; 'pair.': 0.16; 'pairs': 0.16; 'silly': 0.16; 'subject: \n ': 0.16; 'subject:unicode': 0.16; 'subject:variable': 0.16; 'surrogate': 0.16; 'ucs-4': 0.16; 'string': 0.17; 'wrote:': 0.17; 'fix': 0.17; 'unicode': 0.17; 'of.': 0.22; 'seems': 0.23; 'paul': 0.24; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; "doesn't": 0.28; 'build,': 0.29; "d'aprano": 0.29; 'received:192.168.1.3': 0.29; 'represented': 0.29; 'steven': 0.29; 'writes:': 0.29; 'code': 0.31; 'could': 0.32; 'problem': 0.33; 'to:addr:python-list': 0.33; 'explain': 0.36; 'characters': 0.36; 'two': 0.37; 'why': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'little': 0.39; 'received:192.168': 0.40; 'think': 0.40; 'most': 0.61; 'containing': 0.61; 'first': 0.61; 'wide': 0.62; 'leaving': 0.62; 'between': 0.63; 'more': 0.63; 'middle': 0.66; 'header:Reply-To:1': 0.68; 'reply-to:no real name:2**0': 0.72; 'everything.': 0.84; 'reply-to:addr:python.org': 0.84; 'subject:value': 0.84; 'more?': 0.91 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.0 cv=W6e6pGqk c=1 sm=1 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=DKcI9XZsuF4A:10 a=XPIpSzf0io4A:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=FI9Gn-gLjTAA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=kZ7UWmmPAAAA:8 a=cMFfHVTjVGBiTBQBBrcA:9 a=wPNLvfGTeEIA:10 a=pyH5b1fOeEsA:10 a=0nF1XD0wxitMEM03M9B4ZQ==:117 X-AUTH: mrabarnett:2500 Date: Sat, 18 Aug 2012 19:59:32 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: python-list@python.org Subject: Re: How do I display unicode value stored in a string variable using ord() References: <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <7xehn4vyya.fsf@ruckus.brouhaha.com> In-Reply-To: <7xehn4vyya.fsf@ruckus.brouhaha.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 24 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1345316373 news.xs4all.nl 6925 [2001:888:2000:d::a6]:43304 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:27326 On 18/08/2012 19:26, Paul Rubin wrote: > Steven D'Aprano writes: >> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters >> using two code points. This is fragile and doesn't work very well, >> because string-handling methods can break the surrogate pairs apart, >> leaving you with invalid unicode string. Not good.) > ... >> With PEP 393, each Python string will be stored in the most efficient >> format possible: > > Can you explain the issue of "breaking surrogate pairs apart" a little > more? Switching between encodings based on the string contents seems > silly at first glance. Strings are immutable so I don't understand why > not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in > Latin-based alphabets and UTF-16 may be more efficient for some other > languages. I think even UCS-4 doesn't completely fix the surrogate pair > issue if it means the only thing I can think of. > On a narrow build, codepoints outside the BMP are stored as a surrogate pair (2 codepoints). On a wide build, all codepoints can be represented without the need for surrogate pairs. The problem with strings containing surrogate pairs is that you could inadvertently slice the string in the middle of the surrogate pair.