Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!ecngs!feeder2.ecngs.de!novso.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '(even': 0.05; 'encoding': 0.05; 'subject:text': 0.05; 'utf-8': 0.07; 'string': 0.09; '128': 0.09; 'craft': 0.09; 'stating': 0.09; 'subject:question': 0.10; 'cc:addr:python-list': 0.11; 'character.': 0.16; 'codecs': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'sequence.': 0.16; 'throw': 0.16; 'wrote:': 0.18; '>>>': 0.22; '(in': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'string,': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'header :In-Reply-To:1': 0.27; 'appear': 0.29; 'correct': 0.29; 'chris': 0.29; "doesn't": 0.30; 'message-id:@mail.gmail.com': 0.30; 'python).': 0.31; 'file': 0.32; 'quite': 0.32; 'skip:b 30': 0.33; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'sequence': 0.36; 'should': 0.36; 'error.': 0.37; 'expected': 0.38; 'represent': 0.38; 'pm,': 0.38; 'simply': 0.61; 'name': 0.63; 'kind': 0.63; 'side': 0.67; 'between': 0.67; 'mar': 0.68; 'invalid': 0.68; '2015': 0.84; 'establishes': 0.84; 'object:': 0.84; 'effects.': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=aRA8ZK68xifd+opltwcrks9vm2+SSZVRvlqd3TlzMAA=; b=crva7XpaH2g0tzO6BiBp9dWOJVI/tzFhC8goG0nFQnYeUGSbVcAMjfYZc9n7zSIRPc Ud03Yvodg0f0tHwe0+TsvPWuNeebQdLAXxLDtzGe31sqCUlHJ2x1r07URj5uu0VPBlUk Tw8FDcYCfkYZdSldTc3ytYvv69QdP9i7Kd8cYy+zRuRc1CTn9QWL6ztgW3Gnqhl29kjr dubNLOhCCOBPcqiBmWgasDRTGFS/zzIK1xW8EAmJ/hP8oGgRyAa8ubl02aVpwWtDEvMd sr8jaOSYSBhHqMjiufDHnxf12OqcEmZR0ibQR9Cimx05O1vpf+2ahKl68ECL/5Ke8tai DNIA== MIME-Version: 1.0 X-Received: by 10.42.51.68 with SMTP id d4mr19423463icg.26.1425803017610; Sun, 08 Mar 2015 00:23:37 -0800 (PST) In-Reply-To: <87twxvvrjl.fsf@elektro.pacujo.net> References: <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> <87twxxxbvd.fsf@elektro.pacujo.net> <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> <87twxw4xlz.fsf@elektro.pacujo.net> <54fba9d4$0$12988$c3e8da3$5496439d@news.astraweb.com> <87y4n8uf9a.fsf@elektro.pacujo.net> <87twxvvrjl.fsf@elektro.pacujo.net> Date: Sun, 8 Mar 2015 19:23:37 +1100 Subject: Re: Newbie question about text encoding From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 32 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425803025 news.xs4all.nl 2955 [2001:888:2000:d::a6]:39911 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:87141 On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> Once again, you appear to be surprised that invalid data is failing. >> Why is this so strange? U+DD00 is not a valid character. It is quite >> correct to throw this error. > > '\udd00' is a valid str object: > > >>> '\udd00' > '\udd00' > >>> '\udd00'.encode('utf-32') > b'\xff\xfe\x00\x00\x00\xdd\x00\x00' > >>> '\udd00'.encode('utf-16') > b'\xff\xfe\x00\xdd' > > I was simply stating that UTF-8 is not a bijection between unicode > strings and octet strings (even forgetting Python). Enriching Unicode > with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not > without side effects. But it's not a valid Unicode string, so a Unicode encoding can't be expected to cope with it. Mathematically, 0xC0 0x80 would represent U+0000, and some UTF-8 codecs generate and accept this (in order to allow U+0000 without ever yielding 0x00), but that doesn't mean that UTF-8 should allow that byte sequence. The only reason to craft some kind of Unicode string for any arbitrary sequence of bytes is the "smuggling" effect used for file name handling. There is no reason to support invalid Unicode codepoints. ChrisA