Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <ksqe8s$1jk$1@ger.gmane.org>
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <CAPTjJmoRUUuQgpjukkctnGe4ydHGYr2CEgD9uDtmBRHSejpW0g@mail.gmail.com> <CA+vVgJVxAHEk_Z4uzZFNYtmEFXmyyoo0Jg_rBeqHSDBuF0E1Ug@mail.gmail.com> <CAPTjJmqBs0PfQ5ahfURtSyCimPnFU-1HpFndO8h6T+Vbkj10Bg@mail.gmail.com> <51EFEC17.90303@gmail.com> <ksp493$m5g$1@ger.gmane.org> <CAPTjJmoCsg0GX0ntGowbzm-gODCA8ttnf3QFirbsWa+mu0PCpw@mail.gmail.com> <ksqe8s$1jk$1@ger.gmane.org>
Date: Thu, 25 Jul 2013 15:58:34 +1000
Subject: Re: RE Module Performance
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5082.1374732265.3114.python-list@python.org>
Lines: 42
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51194

On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka <storchaka@gmail.com> wro=
te:
> 24.07.13 21:15, Chris Angelico =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=B2=
(=D0=BB=D0=B0):
>
>> To my mind, exposing UTF-16
>> surrogates to the application is a bug to be fixed, not a feature to
>> be maintained.
>
>
> Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
> area) to represent undecodable bytes with surrogateescape error handler.

That's a deliberate and conscious use of the codepoints; that's not
what I'm talking about here. Suppose you read a UTF-8 stream of bytes
from a file, and decode them into your language's standard string
type. At this point, you should be working with a string of Unicode
codepoints:

"\22\341\210\264\360\222\215\205"

-->

"\x12\u1234\U00012345"

The incoming byte stream has a length of 8, the resulting character
stream has a length of 3. Now, if the language wants to use UTF-16
internally, it's free to do so:

0012 1234 d808 df45

When I referred to exposing surrogates to the application, this is
what I'm talking about. If decoding the above byte stream results in a
length 4 string where the last two are \xd808 and \xdf45, then it's
exposing them. If it's a length 3 string where the last is \U00012345,
then it's hiding them. To be honest, I don't imagine I'll ever see a
language that stores strings in UTF-16 and then exposes them to the
application as UTF-32; there's very little point. But such *is*
possible, and if it's working closely with libraries that demand
UTF-16, it might well make sense to do things that way.

ChrisA