Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Frank Millman <frank@chagford.com>
Subject: Re: Flexible string representation, unicode, typography, ...
Date: Sat, 25 Aug 2012 11:46:34 +0200
References: <a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com> <1874857c-68ef-4c1b-b15a-46ef47df9445@googlegroups.com> <mailman.3784.1345854291.4697.python-list@python.org> <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> <k1a40u$r47$2@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:14.0) Gecko/20120713 Thunderbird/14.0
In-Reply-To: <k1a40u$r47$2@ger.gmane.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3793.1345888006.4697.python-list@python.org>
Lines: 46
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27860

On 25/08/2012 10:58, Mark Lawrence wrote:
> On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
>>
>> Unicode design: a flat table of code points, where all code
>> points are "equals".
>> As soon as one attempts to escape from this rule, one has to
>> "pay" for it.
>> The creator of this machinery (flexible string representation)
>> can not even benefit from it in his native language (I think
>> I'm correctly informed).
>>
>> Hint: Google -> "Das grosse Eszett"
>>
>> jmf
>>
>
> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> still baffled as to the point if any.  Could someone please enlightem me?
>

Here's what I think he is saying. I am posting this to test the water. I 
am also confused, and if I have got it wrong hopefully someone will 
correct me.

In python 3.3, unicode strings are now stored as follows -
   if all characters can be represented by 1 byte, the entire string is 
composed of 1-byte characters
   else if all characters can be represented by 1 or 2 bytea, the entire 
string is composed of 2-byte characters
   else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number 
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as 
the rest of the world will tend to have strings where at least one 
character requires 2 or 4 bytes. So they incur the overhead, without 
getting any benefit.

Therefore, I think he is saying that he would have preferred that python 
standardise on 4-byte characters, on the grounds that the saving in 
memory does not justify the performance overhead.

Frank Millman