Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <kspja1$64u$1@ger.gmane.org>
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <CAPTjJmoRUUuQgpjukkctnGe4ydHGYr2CEgD9uDtmBRHSejpW0g@mail.gmail.com> <CA+vVgJVxAHEk_Z4uzZFNYtmEFXmyyoo0Jg_rBeqHSDBuF0E1Ug@mail.gmail.com> <CAPTjJmqBs0PfQ5ahfURtSyCimPnFU-1HpFndO8h6T+Vbkj10Bg@mail.gmail.com> <51EFEC17.90303@gmail.com> <ksp493$m5g$1@ger.gmane.org> <CAPTjJmoCsg0GX0ntGowbzm-gODCA8ttnf3QFirbsWa+mu0PCpw@mail.gmail.com> <kspja1$64u$1@ger.gmane.org>
Date: Thu, 25 Jul 2013 08:19:21 +1000
Subject: Re: RE Module Performance
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5068.1374704365.3114.python-list@python.org>
Lines: 32
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51171

On Thu, Jul 25, 2013 at 8:09 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 7/24/2013 2:15 PM, Chris Angelico wrote:
>> To my mind, exposing UTF-16 surrogates to the application is a bug
>> to be fixed, not a feature to be maintained.
>
> It is definitely not a feature, but a proper UTF-16 implementation would not
> expose them except to codecs, just as with the PEP 393 implementation. (In
> both cases, I am excluding the sys size function as 'exposing to the
> application'.)
>
>> But since we can get the best of both worlds with only
>> a small amount of overhead, I really don't see why anyone should be
>> objecting.
>
> I presume you are referring to the PEP 393 1-2-4 byte implementation. Given
> how well it has been optimized, I think it was the right choice for Python.
> But a language that now uses USC2 or defective UTF-16 on all platforms might
> find the auxiliary array an easier fix.
>

I'm referring here to objections like jmf's, and also to threads like this:

http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html

According to the ECMAScript people, UTF-16 and exposing surrogates to
the application is a critical feature to be maintained. I disagree.
But it's not my language, so I'm stuck with it. (I ended up writing a
little wrapper function in C that detects unpaired surrogates, but
that still doesn't deal with the possibility that character indexing
can create a new character that was never there to start with.)

ChrisA