Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'string.': 0.05; '(of': 0.07; 'string': 0.09; 'bits': 0.09; 'next,': 0.09; 'python:': 0.09; 'subject:into': 0.09; 'subject:string': 0.09; 'used.': 0.09; 'subject:How': 0.10; 'python': 0.11; 'kurt': 0.12; 'suggest': 0.14; '2.7': 0.14; 'behave': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'pairs': 0.16; 'planes': 0.16; 'received:192.168.1.4': 0.16; 'surrogate': 0.16; 'wrongly': 0.16; 'wrote:': 0.18; "python's": 0.19; 'saying': 0.22; 'header:User-Agent:1': 0.23; 'string,': 0.24; 'unicode': 0.24; 'fine': 0.24; 'regardless': 0.24; '(or': 0.24; 'handling': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'correct': 0.29; 'wondering': 0.29; 'character': 0.29; 'points': 0.29; "doesn't": 0.30; 'characters': 0.30; 'subject:list': 0.30; 'largest': 0.30; 'code': 0.31; "d'aprano": 0.31; 'steven': 0.31; 'basic': 0.35; 'more,': 0.35; 'but': 0.35; 'there': 0.35; 'version': 0.36; 'subject:?': 0.36; 'should': 0.36; 'unit': 0.37; 'wrong': 0.37; 'two': 0.37; 'performance': 0.37; 'to:addr:python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'users': 0.40; 'problems.': 0.60; 'most': 0.60; 'matter': 0.61; 'limit': 0.70; 'strategies': 0.77; 'yourself': 0.78; 'bmp,': 0.84; 'characters,': 0.84; 'differences': 0.93 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.1 cv=Uv7tNoAB c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=u9EReRu7m0cA:10 a=8o1qy4FnlwsA:10 a=P9AMOoRtjhEA:10 a=ihvODaAuJD4A:10 a=IkcTkHD0fZMA:10 a=EBOSESyhAAAA:8 a=EtYdUsO4n7KhvSAt0c0A:9 a=QEXdDO2ut3YA:10 X-AUTH: mrabarnett:2500 Date: Sun, 07 Sep 2014 15:52:28 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: python-list@python.org Subject: Re: How to turn a string into a list of integers? References: <1amjdb-p3n.ln1@chris.zbmc.eu> <1k9odb-1qs.ln1@chris.zbmc.eu> <540aa002$0$29968$c3e8da3$5496439d@news.astraweb.com> <540b504a$0$29974$c3e8da3$5496439d@news.astraweb.com> <540bb91c$0$29969$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: <540bb91c$0$29969$c3e8da3$5496439d@news.astraweb.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 50 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1410101559 news.xs4all.nl 2954 [2001:888:2000:d::a6]:35590 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:77672 On 2014-09-07 02:47, Steven D'Aprano wrote: > Kurt Mueller wrote: > >> Processing any Unicode string will work with small and wide >> python 2.7 builds and also with python >3.3? >> ( parts of small build python will not work with values over 0xFFFF ) >> ( strings with surrogate pairs will not work correctly on small build >> python ) > > > If you limit yourself to code points in the Basic Multilingual Plane, U+0000 > to U+FFFF, then Python's Unicode handling works fine no matter what version > or implementation is used. Since most people use only the BMP, you may not > notice any problems. > > (Of course, there are performance and memory-usage differences from one > version to the next, but the functionality works correctly.) > > If you use characters from the supplementary planes ("astral characters"), > then: > > * wide builds will behave correctly; > * narrow builds will wrongly treat astral characters as two > independent characters, which means functions like len() > and string slicing will do the wrong thing; > * Python 3.3 doesn't use narrow and wide builds any more, > and also behaves correctly with astral characters. > > > So there are three strategies for correct Unicode support in Python: > > * avoid astral characters (and trust your users will also avoid them); > > * use a wide build; > > * use Python 3.3 or higher. > > > In case you are wondering what Python 3.3 does differently, when it builds a > string, it works out the largest code point in the string. If the largest > code point is no greater than U+00FF, it stores the string in Latin 1 using > 8 bits per character; if the largest code point is no greater than U+FFFF, > then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the > same); if the string contains any astral characters, then it uses UTF-32. > So regardless of the string, each character uses a single code unit. Only > the size of the code unit varies. > I don't think you should be saying that it stores the string in Latin-1 or UTF-16 because that might suggest that they are encoded. They aren't.