Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <51ac3bd6$0$11118$c3e8da3@news.astraweb.com>
References: <2c425f2b-99de-4453-964e-c585f2043f71@googlegroups.com> <slrnkqlrn8.ap4.giorgos.tzampanakis@brilliance.eternal-september.org> <16b6b38b-aa9e-4148-8c97-741a8f593dac@googlegroups.com> <slrnkqmmcj.674.giorgos.tzampanakis@brilliance.eternal-september.org> <mailman.2543.1370184698.3114.python-list@python.org> <f2e4a654-b637-4d21-9a70-c5f95b53cd9c@googlegroups.com> <mailman.2545.1370186659.3114.python-list@python.org> <b05a6f99-f2bb-4065-8be5-ded497eab83a@googlegroups.com> <mailman.2546.1370188541.3114.python-list@python.org> <749e23ce-9b40-4ed4-aa6a-b06c2d7a1c24@googlegroups.com> <mailman.2547.1370190695.3114.python-list@python.org> <18755849-35bc-4925-811a-8f6f9fb5bf9c@googlegroups.com> <CAPTjJmrox7gjEpa9LLU5qH=MsNs7VcZ7ZqpEo8dLM98o-TTECg@mail.gmail.com> <CAN8CLg=OZSfZDNYPgPSo+Oqogfk8Rm-5jv2_SRJW03u600WQKA@mail.gmail.com> <mailman.2578.1370235776.3114.python-list@python.org> <8c16324f-da12-44ff-bf2f-4ae56f9127c0@googlegroups.com> <51ac3bd6$0$11118$c3e8da3@news.astraweb.com>
Date: Mon, 3 Jun 2013 17:36:39 +1000
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2585.1370245002.3114.python-list@python.org>
Lines: 34
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:46775

On Mon, Jun 3, 2013 at 4:46 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Then, when
> you try to read the file names in UTF-8, you hit an illegal byte, half of
> a surrogate pair perhaps, and everything blows up.

Minor quibble: Surrogates are an artifact of UTF-16, so they're 16-bit
values like 0xD808 or 0xDF45. Possibly what you're talking about here
is a continuation byte, which in UTF-8 are used only after a lead
byte. For instance: 0xF0 0x92 0x8D 0x85 is valid UTF-8, but 0x41 0x92
is not.

There is one other really annoying thing to deal with, and that's the
theoretical UTF-8 encoding of a UTF-16 surrogate. (I say "theoretical"
because strictly, these are invalid; UTF-8 does not encode invalid
codepoints.) 0xED 0xA0 0x88 and 0xED 0xBD 0x85 encode the two I
mentioned above. Depending on what's reading the filename, you might
actually have these throw errors, or maybe not. Python's decoder is
correctly strict:

>>> str(b'\xed\xa0\x88','utf-8')
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    str(b'\xed\xa0\x88','utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2:
invalid continuation byte

Actually, I'm not sure here, but I think that error message may be
wrong, or at least unclear. It's perfectly possible to decode those
bytes using the UTF-8 algorithm; you end up with the value 0xD808,
which you then reject because it's a surrogate. But maybe the Python
UTF-8 decoder simplifies some of that.

ChrisA