Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Sun, 07 Sep 2014 15:52:28 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: How to turn a string into a list of integers?
References: <h2ejdb-mdk.ln1@chris.zbmc.eu> <mailman.13738.1409748804.18130.python-list@python.org> <1amjdb-p3n.ln1@chris.zbmc.eu> <mailman.13776.1409864831.18130.python-list@python.org> <1k9odb-1qs.ln1@chris.zbmc.eu> <mailman.13801.1409939785.18130.python-list@python.org> <540aa002$0$29968$c3e8da3$5496439d@news.astraweb.com> <mailman.13833.1410005730.18130.python-list@python.org> <540b504a$0$29974$c3e8da3$5496439d@news.astraweb.com> <mailman.13842.1410031704.18130.python-list@python.org> <540bb91c$0$29969$c3e8da3$5496439d@news.astraweb.com>
In-Reply-To: <540bb91c$0$29969$c3e8da3$5496439d@news.astraweb.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.13849.1410101559.18130.python-list@python.org>
Lines: 50
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:77672

On 2014-09-07 02:47, Steven D'Aprano wrote:
> Kurt Mueller wrote:
>
>> Processing any Unicode string will work with small and wide
>> python 2.7 builds and also with python >3.3?
>> ( parts of small build python will not work with values over 0xFFFF )
>> ( strings with surrogate pairs will not work correctly on small build
>> python )
>
>
> If you limit yourself to code points in the Basic Multilingual Plane, U+0000
> to U+FFFF, then Python's Unicode handling works fine no matter what version
> or implementation is used. Since most people use only the BMP, you may not
> notice any problems.
>
> (Of course, there are performance and memory-usage differences from one
> version to the next, but the functionality works correctly.)
>
> If you use characters from the supplementary planes ("astral characters"),
> then:
>
> * wide builds will behave correctly;
> * narrow builds will wrongly treat astral characters as two
>    independent characters, which means functions like len()
>    and string slicing will do the wrong thing;
> * Python 3.3 doesn't use narrow and wide builds any more,
>    and also behaves correctly with astral characters.
>
>
> So there are three strategies for correct Unicode support in Python:
>
> * avoid astral characters (and trust your users will also avoid them);
>
> * use a wide build;
>
> * use Python 3.3 or higher.
>
>
> In case you are wondering what Python 3.3 does differently, when it builds a
> string, it works out the largest code point in the string. If the largest
> code point is no greater than U+00FF, it stores the string in Latin 1 using
> 8 bits per character; if the largest code point is no greater than U+FFFF,
> then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
> same); if the string contains any astral characters, then it uses UTF-32.
> So regardless of the string, each character uses a single code unit. Only
> the size of the code unit varies.
>
I don't think you should be saying that it stores the string in Latin-1
or UTF-16 because that might suggest that they are encoded. They aren't.