Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sun, 19 Aug 2012 08:35:23 -0400
From: Dave Angel <d@davea.name>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0
MIME-Version: 1.0
To: wxjmfauth@gmail.com
Subject: Re: New internal string format in 3.3
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <d575737d-c1e3-47db-9c7b-10fe0300cba7@googlegroups.com> <mailman.3457.1345305136.4697.python-list@python.org> <4c62a649-bc21-4e47-9c0f-acb1b1e70e36@googlegroups.com> <mailman.3462.1345310859.4697.python-list@python.org> <f9beca36-3a12-41f2-bdc2-95b159c162d1@googlegroups.com> <mailman.3468.1345314897.4697.python-list@python.org> <5030891f$0$29978$c3e8da3$5496439d@news.astraweb.com> <mailman.3485.1345362201.4697.python-list@python.org> <5030aa44$0$29978$c3e8da3$5496439d@news.astraweb.com> <mailman.3489.1345369039.4697.python-list@python.org> <11931ec9-1858-4ae8-8a61-1d154d105229@googlegroups.com> <mailman.3492.1345372006.4697.python-list@python.org> <73c85f3b-a4a9-4812-bc41-132b5126874c@googlegroups.com>
In-Reply-To: <73c85f3b-a4a9-4812-bc41-132b5126874c@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Cc: python-list@python.org
Precedence: list
Reply-To: d@davea.name
Newsgroups: comp.lang.python
Message-ID: <mailman.3498.1345379751.4697.python-list@python.org>
Lines: 39
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27385

(pardon the resend, but I accidentally omitted a couple of words)
On 08/19/2012 08:14 AM, wxjmfauth@gmail.com wrote:
> Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
>> <SNIP>
>>
>>
>> No, it uses Unicode, and as an optimization, attempts to store the
>> codepoints in less than four bytes for most strings. The fact that a
>> one-byte storage format happens to look like latin-1 is rather
>> coincidental.
>>
> And this this is the common basic mistake. You do not push your
> argumentation far enough. A character may "fall" accidentally in a latin-1.
> The problem lies in these european characters, which can not fall in this
> coding. This *is* the cause of the negative side effects.
> If you are using a correct coding scheme, like cp1252, mac-roman or
> iso-8859-15, you will never see such a negative side effect.
> Again, the problem is not the result, the encoded character. The critical
> part is the character which may cause this side effect.
> You should think "character set" and not encoded "code point", considering
> this kind of expression has a sense in 8-bits coding scheme.
>
> jmf

But that choice was made decades ago when Unicode picked its second 128
characters.  The internal form used in this PEP is simply the low-order
byte of the Unicode code point.  Trying to scan the string deciding if
converting to cp1252 (for example) would work, would be a much more
expensive operation than seeing how many bytes it'd take for the largest
code point.

The 8 bit form is used if all the code points are less than 256.  That
is a simple description, and simple code.  As several people have said,
the fact that this byte matches on of the DECODED forms is coincidence.

-- 

DaveA