Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!novso.com!news.skynet.be!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Fri, 29 Mar 2013 02:00:24 +0000
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130307 Thunderbird/17.0.4
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]
References: <mailman.3703.1364248275.2939.python-list@python.org> <kitdqr$4m4$2@ger.gmane.org> <nad-8CB9C0.18315026032013@news.gmane.org> <mailman.3805.1364385073.2939.python-list@python.org> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <mailman.3845.1364441182.2939.python-list@python.org> <d2cc443a-e049-42ed-abc6-66b5ea600fe7@j1g2000pbq.googlegroups.com> <mailman.3860.1364451682.2939.python-list@python.org> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <mailman.3863.1364463394.2939.python-list@python.org> <rOednY4OeOjbqcnMnZ2dnUVZ_oWdnZ2d@westnet.com.au> <51543f45$0$29998$c3e8da3$5496439d@news.astraweb.com> <944f195c-cbfe-47e1-a963-05fe3d98238d@5g2000yqz.googlegroups.com> <CAPTjJmr-u_53Zyj-b120M-UqrBc1=_2R5W+Kou2GhKHJPkficA@mail.gmail.com> <mailman.3898.1364487167.2939.python-list@python.org> <5154e2dd$0$29974$c3e8da3$5496439d@news.astraweb.com> <CAPTjJmo2oBHs-uh186KzYNEMU89xZSAHmQmOXVj96x30jgk6tQ@mail.gmail.com>
In-Reply-To: <CAPTjJmo2oBHs-uh186KzYNEMU89xZSAHmQmOXVj96x30jgk6tQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.3931.1364522420.2939.python-list@python.org>
Lines: 56
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:42212

On 29/03/2013 00:54, Chris Angelico wrote:
> On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
>> strings. It's only strings in the SMPs that could need surrogate pairs,
>> and they don't need them in Python's implementation since it's a full 32-
>> bit implementation. So where do the surrogate pairs come into this?
>
> PEP 393 says:
> """
> wstr_length, wstr: representation in platform's wchar_t
> (null-terminated). If wchar_t is 16-bit, this form may use surrogate
> pairs (in which cast wstr_length differs form length). wstr_length
> differs from length only if there are surrogate pairs in the
> representation.
>
> utf8_length, utf8: UTF-8 representation (null-terminated).
>
> data: shortest-form representation of the unicode string. The string
> is null-terminated (in its respective representation).
>
> All three representations are optional, although the data form is
> considered the canonical representation which can be absent only while
> the string is being created. If the representation is absent, the
> pointer is NULL, and the corresponding length field may contain
> arbitrary data.
> """
>
> If the string was created from a wchar_t string, that string will be
> retained, and presumably can be used to re-output the original for a
> clean and fast round-trip. Same with...
>
>> I also wonder why the implementation bothers keeping a UTF-8
>> representation. That sounds like premature optimization to me. Surely you
>> only need it when writing to a file with UTF-8 encoding? For most
>> strings, that will never happen.
>
> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
> of content will go out in the same encoding it came in in, so it makes
> sense to hang onto it where possible.
>
> Though, from the same quote: The UTF-8 representation is
> null-terminated. Does this mean that it can't be used if there might
> be a \0 in the string?
>
You could ask the same question about any encoding.

It's only an issue if it's passed to a C function which expects a
null-terminated string.

> Minor nitpick, btw:
>> (in which cast wstr_length differs form length)
> Should be "in which case" and "from". Who has the power to correct
> typos in PEPs?
>