Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Mark Lawrence <breamoreboy@yahoo.co.uk>
Subject: Re: Flexible string representation, unicode, typography, ...
Date: Sat, 25 Aug 2012 12:05:08 +0100
References: <a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com> <1874857c-68ef-4c1b-b15a-46ef47df9445@googlegroups.com> <mailman.3784.1345854291.4697.python-list@python.org> <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> <k1a40u$r47$2@ger.gmane.org> <k1a6to$gku$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:14.0) Gecko/20120713 Thunderbird/14.0
In-Reply-To: <k1a6to$gku$1@ger.gmane.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3797.1345892703.4697.python-list@python.org>
Lines: 57
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27865

On 25/08/2012 10:46, Frank Millman wrote:
> On 25/08/2012 10:58, Mark Lawrence wrote:
>> On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
>>>
>>> Unicode design: a flat table of code points, where all code
>>> points are "equals".
>>> As soon as one attempts to escape from this rule, one has to
>>> "pay" for it.
>>> The creator of this machinery (flexible string representation)
>>> can not even benefit from it in his native language (I think
>>> I'm correctly informed).
>>>
>>> Hint: Google -> "Das grosse Eszett"
>>>
>>> jmf
>>>
>>
>> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
>> still baffled as to the point if any.  Could someone please enlightem me?
>>
>
> Here's what I think he is saying. I am posting this to test the water. I
> am also confused, and if I have got it wrong hopefully someone will
> correct me.
>
> In python 3.3, unicode strings are now stored as follows -
>    if all characters can be represented by 1 byte, the entire string is
> composed of 1-byte characters
>    else if all characters can be represented by 1 or 2 bytea, the entire
> string is composed of 2-byte characters
>    else the entire string is composed of 4-byte characters
>
> There is an overhead in making this choice, to detect the lowest number
> of bytes required.
>
> jmfauth believes that this only benefits 'english-speaking' users, as
> the rest of the world will tend to have strings where at least one
> character requires 2 or 4 bytes. So they incur the overhead, without
> getting any benefit.
>
> Therefore, I think he is saying that he would have preferred that python
> standardise on 4-byte characters, on the grounds that the saving in
> memory does not justify the performance overhead.
>
> Frank Millman
>
>

I thought Terry Reedy had shot down any claims about performance 
overhead, and that the memory savings in many cases must be substantial 
and therefore worthwhile.  Or have I misread something?  Or what?

-- 
Cheers.

Mark Lawrence.