Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'attribute': 0.05; 'sufficient': 0.05; 'bits': 0.07; 'character,': 0.07; 'interpreted': 0.07; 'responding': 0.07; 'subject:How': 0.09; 'python': 0.09; 'encoding.': 0.09; 'internally': 0.09; 'lost.': 0.09; 'pep': 0.09; 'specific.': 0.09; 'subject:()': 0.09; 'subject:string': 0.09; 'subject:using': 0.09; 'whichever': 0.09; 'cc:addr:python-list': 0.10; 'aug': 0.13; 'ascii,': 0.16; 'subject: \n ': 0.16; 'subject:unicode': 0.16; 'subject:variable': 0.16; 'string': 0.17; 'wrote:': 0.17; 'bytes': 0.17; 'certainly': 0.17; 'string,': 0.17; 'unicode': 0.17; '(in': 0.18; 'saying': 0.18; 'versions': 0.20; 'trying': 0.21; '3.2': 0.22; 'dropped': 0.22; 'names.': 0.22; 'smallest': 0.22; 'cc:2**0': 0.23; "python's": 0.23; 'idea': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'fit': 0.26; 'coding': 0.27; 'changed.': 0.29; 'efficiently': 0.29; 'finds': 0.29; 'reduced': 0.29; 'represented': 0.29; 'points': 0.29; 'class': 0.29; 'e.g.': 0.30; 'function': 0.30; 'figure': 0.30; 'code': 0.31; 'point': 0.31; 'not.': 0.32; 'likely': 0.33; 'operations': 0.33; 'point.': 0.33; "can't": 0.34; 'largest': 0.35; 'described': 0.35; 'process,': 0.35; 'pm,': 0.35; 'there': 0.35; 'wanted': 0.36; 'data.': 0.36; 'anything': 0.36; 'previous': 0.37; 'subject:: ': 0.38; 'mean': 0.38; 'some': 0.38; 'nothing': 0.38; 'several': 0.39; 'performance': 0.39; 'takes': 0.39; 'received:192': 0.39; 'space': 0.39; 'received:192.168': 0.40; 'most': 0.61; 'further': 0.61; 'real': 0.61; 'times': 0.63; 'more': 0.63; 'french': 0.64; 'great': 0.64; 'header:Reply-To:1': 0.68; 'received:74.208': 0.71; 'reply-to:no real name:2**0': 0.72; 'complaint': 0.84; "it'd": 0.84; 'subject:value': 0.84; 'claim.': 0.91 Date: Fri, 17 Aug 2012 23:30:22 -0400 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Ian Kelly Subject: Re: How do I display unicode value stored in a string variable using ord() References: <253ddd61-4bb5-4f46-b58c-525e55b27558@googlegroups.com> <502EAFB2.7050405@davea.name> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:+1Mhde/EVDkCk92ECu8lYRhKJHSgR7JODPcY16VYmNb YMUQPeuKrhU1nDqVWBzU6MOdUW9aUlNlyE3i1IYMdAGS1mqjlp FrQPJYKx6GV0JFwjj2baz7eHxakKs6sQVeeatTWI+hTOCjgfbf 3MLPHKMx29FrFBA+yPnxT+MucoRWfQEYDG6PR2T4jNK/CfKYuF oEnlkKQsEEDuA/DGASHX9v5eZ5S9zL+iUV/99jcoelZ6ZznebO rWxIxIWbGIsqvtWL4C2qq1pe1GrQ19ym2yjR3/wKHExPZPkDfV pQuhiZkQ6A+HaC2ZDY/sWpra6pvgMQYoKnO08xb/u2QhfoiZw= = Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: d@davea.name List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 44 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1345260650 news.xs4all.nl 6923 [2001:888:2000:d::a6]:41115 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:27279 On 08/17/2012 08:21 PM, Ian Kelly wrote: > On Aug 17, 2012 2:58 PM, "Dave Angel" wrote: >> The internal coding described in PEP 393 has nothing to do with latin-1 >> encoding. > It certainly does. PEP 393 provides for Unicode strings to be represented > internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and > sufficient to contain the data. I understand the complaint to be that while > the change is great for strings that happen to fit in Latin-1, it is less > efficient than previous versions for strings that do not. That's not the way I interpreted the PEP 393. It takes a pure unicode string, finds the largest code point in that string, and chooses 1, 2 or 4 bytes for every character, based on how many bits it'd take for that largest code point. Further i read it to mean that only 00 bytes would be dropped in the process, no other bytes would be changed. I take it as a coincidence that it happens to match latin-1; that's the way Unicode happened historically, and is not Python's fault. Am I reading it wrong? I also figure this is going to be more space efficient than Python 3.2 for any string which had a max code point of 65535 or less (in Windows), or 4billion or less (in real systems). So unless French has code points over 64k, I can't figure that anything is lost. I have no idea about the times involved, so i wanted a more specific complaint. > I don't know how much merit there is to this claim. It would seem to me > that even in non-western locales, most strings are likely to be Latin-1 or > even ASCII, e.g. class and attribute and function names. > > The jmfauth rant I was responding to was saying that French isn't efficiently encoded, and that performance of some vague operations were somehow reduced by several fold. I was just trying to get him to be more specific. -- DaveA