Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!ecngs!feeder2.ecngs.de!novso.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <87twxvvrjl.fsf@elektro.pacujo.net>
References: <aae131a7-29a1-4f79-ac16-d1e223616c51@googlegroups.com> <ef520397-b1f0-47bf-8d24-585a9ba230e2@googlegroups.com> <CAPTjJmreaPu7MZQgmbFNnhhg9R6w9dHPPo=yBbMoG85HxK+H_Q@mail.gmail.com> <mailman.19274.1424970167.18130.python-list@python.org> <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <mailman.33.1425444900.21433.python-list@python.org> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <c6caaa76-f448-4c2f-8874-c1f2716da744@googlegroups.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> <54fba9d4$0$12988$c3e8da3$5496439d@news.astraweb.com> <87y4n8uf9a.fsf@elektro.pacujo.net> <mailman.163.1425800257.21433.python-list@python.org> <87twxvvrjl.fsf@elektro.pacujo.net>
Date: Sun, 8 Mar 2015 19:23:37 +1100
Subject: Re: Newbie question about text encoding
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.166.1425803025.21433.python-list@python.org>
Lines: 32
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:87141

On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character. It is quite
>> correct to throw this error.
>
> '\udd00' is a valid str object:
>
>    >>> '\udd00'
>    '\udd00'
>    >>> '\udd00'.encode('utf-32')
>    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>    >>> '\udd00'.encode('utf-16')
>    b'\xff\xfe\x00\xdd'
>
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.

But it's not a valid Unicode string, so a Unicode encoding can't be
expected to cope with it. Mathematically, 0xC0 0x80 would represent
U+0000, and some UTF-8 codecs generate and accept this (in order to
allow U+0000 without ever yielding 0x00), but that doesn't mean that
UTF-8 should allow that byte sequence.

The only reason to craft some kind of Unicode string for any arbitrary
sequence of bytes is the "smuggling" effect used for file name
handling. There is no reason to support invalid Unicode codepoints.

ChrisA