MIME-Version: 1.0
In-Reply-To: <54fc9400$0$13009$c3e8da3$5496439d@news.astraweb.com>
References: <aae131a7-29a1-4f79-ac16-d1e223616c51@googlegroups.com> <ef520397-b1f0-47bf-8d24-585a9ba230e2@googlegroups.com> <CAPTjJmreaPu7MZQgmbFNnhhg9R6w9dHPPo=yBbMoG85HxK+H_Q@mail.gmail.com> <mailman.19274.1424970167.18130.python-list@python.org> <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <mailman.33.1425444900.21433.python-list@python.org> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <c6caaa76-f448-4c2f-8874-c1f2716da744@googlegroups.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> <54fba9d4$0$12988$c3e8da3$5496439d@news.astraweb.com> <87y4n8uf9a.fsf@elektro.pacujo.net> <mailman.163.1425800257.21433.python-list@python.org> <87twxvvrjl.fsf@elektro.pacujo.net> <54fc9400$0$13009$c3e8da3$5496439d@news.astraweb.com>
Date: Mon, 9 Mar 2015 08:13:48 +1100
Subject: Re: Newbie question about text encoding
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.175.1425849237.21433.python-list@python.org>
Lines: 37
NNTP-Posting-Host: 2001:888:2000:d::a6
Path: csiph.com!usenet.pasdenom.info!bete-des-vosges.org!feed.ac-versailles.fr!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Xref: csiph.com comp.lang.python:87160

On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.

As to the notion of rejecting the construction of strings containing
these invalid codepoints, I'm not sure. Are there any languages out
there that have a Unicode string type that requires that all
codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind
of thing that's usually done in an obscure language before it hits a
mainstream one.

Pike is similar to Python here. I can create a string with invalid
code points in it:

> "\uFFFE\uDD00";
(1) Result: "\ufffe\udd00"

but I can't UTF-8 encode that:

> string_to_utf8("\uFFFE\uDD00");
Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid.
Unknown program: string_to_utf8("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Or, using the streaming UTF-8 encoder instead of the short-hand:

> Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain();
Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576.
/usr/local/pike/8.1.0/lib/modules/_Charset.so:1:
    _Charset.UTF8enc()->feed("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Does anyone know of a language where you can't even construct the string?

ChrisA