Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.006 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'encoding': 0.05; 'continuation': 0.07; 'reject': 0.07; 'utf-8': 0.07; 'encode': 0.09; 'wrong,': 0.09; 'python': 0.11; '16-bit': 0.16; 'byte,': 0.16; 'codec': 0.16; 'filename,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'instance:': 0.16; 'simplifies': 0.16; 'surrogate': 0.16; 'throw': 0.16; 'wrote:': 0.18; "python's": 0.19; '>>>': 0.22; 'error': 0.23; 'byte': 0.24; 'bytes': 0.24; 'mon,': 0.24; 'possibly': 0.26; 'least': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'that.': 0.31; '"",': 0.31; "d'aprano": 0.31; 'minor': 0.31; 'steven': 0.31; 'with,': 0.31; 'file': 0.32; 'up.': 0.33; '(most': 0.33; 'not.': 0.33; 'maybe': 0.34; 'subject:from': 0.34; 'subject: (': 0.35; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'really': 0.36; 'possible': 0.36; 'half': 0.37; 'two': 0.37; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'recent': 0.39; 'does': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'skip:u 10': 0.60; 'read': 0.60; 'then,': 0.60; 'mentioned': 0.61; "you're": 0.61; 'talking': 0.65; 'here': 0.66; 'invalid': 0.68; 'theoretical': 0.74; 'actually,': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=2oK/UTgC3uWiI9EjAGMUItnSp0w7MuMp7pula2yW6Ac=; b=qUQCzV25n2joDKcR31wfo+m3TW+IdGgM2ymSOkVoDe8WwkagQbwRSEk6nih/lCq4YS tscr/adew3TtTIXw7plAK5JY3Y7d2MCDv6EHEoNGmpfz3MKVBWWvk9u0IGbCnVEWu8hs d/8aHi+UFXxEPvml9830JQlMo4Q0iv0hQFOZk3WcgOQ6C6P+FUBQvLZH9uGviNdiI6wW ggHvYFg0DVSISFyRIp13hcRLWiRn3I8uyakEBn4piOFAE7M3TPBnxx8b8djQ9UwtpqPx DCMsy5EcYpZdNKr5IWrPSBzhBr6fFheZMogT25RYsGnnyUddjWfh5I49wP5PQjwX8iAx aOlQ== MIME-Version: 1.0 X-Received: by 10.58.100.234 with SMTP id fb10mr855376veb.5.1370244999169; Mon, 03 Jun 2013 00:36:39 -0700 (PDT) In-Reply-To: <51ac3bd6$0$11118$c3e8da3@news.astraweb.com> References: <2c425f2b-99de-4453-964e-c585f2043f71@googlegroups.com> <16b6b38b-aa9e-4148-8c97-741a8f593dac@googlegroups.com> <749e23ce-9b40-4ed4-aa6a-b06c2d7a1c24@googlegroups.com> <18755849-35bc-4925-811a-8f6f9fb5bf9c@googlegroups.com> <8c16324f-da12-44ff-bf2f-4ae56f9127c0@googlegroups.com> <51ac3bd6$0$11118$c3e8da3@news.astraweb.com> Date: Mon, 3 Jun 2013 17:36:39 +1000 Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 34 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370245002 news.xs4all.nl 15925 [2001:888:2000:d::a6]:57082 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:46775 On Mon, Jun 3, 2013 at 4:46 PM, Steven D'Aprano wrote: > Then, when > you try to read the file names in UTF-8, you hit an illegal byte, half of > a surrogate pair perhaps, and everything blows up. Minor quibble: Surrogates are an artifact of UTF-16, so they're 16-bit values like 0xD808 or 0xDF45. Possibly what you're talking about here is a continuation byte, which in UTF-8 are used only after a lead byte. For instance: 0xF0 0x92 0x8D 0x85 is valid UTF-8, but 0x41 0x92 is not. There is one other really annoying thing to deal with, and that's the theoretical UTF-8 encoding of a UTF-16 surrogate. (I say "theoretical" because strictly, these are invalid; UTF-8 does not encode invalid codepoints.) 0xED 0xA0 0x88 and 0xED 0xBD 0x85 encode the two I mentioned above. Depending on what's reading the filename, you might actually have these throw errors, or maybe not. Python's decoder is correctly strict: >>> str(b'\xed\xa0\x88','utf-8') Traceback (most recent call last): File "", line 1, in str(b'\xed\xa0\x88','utf-8') UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte Actually, I'm not sure here, but I think that error message may be wrong, or at least unclear. It's perfectly possible to decode those bytes using the UTF-8 algorithm; you end up with the value 0xD808, which you then reject because it's a surrogate. But maybe the Python UTF-8 decoder simplifies some of that. ChrisA