Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!dedekind.zen.co.uk!zen.net.uk!hamilton.zen.co.uk!reader01.nrc01.news.zen.net.uk.POSTED!not-for-mail From: Nobody Subject: Re: A few questiosn about encoding Date: Thu, 13 Jun 2013 11:02:38 +0100 User-Agent: Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity.) Message-Id: Newsgroups: comp.lang.python References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> <51b83e5a$0$29998$c3e8da3$5496439d@news.astraweb.com> <51b9231b$0$29997$c3e8da3$5496439d@news.astraweb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lines: 18 Organization: Zen Internet NNTP-Posting-Host: 8b0529fd.news.zen.co.uk X-Trace: DXC=JDB4VVT\^RDJb;NVcgkZoMa0UP_O8AJoL=dR0\ckLKG@WeZ<[7LZNRFM[;4_ On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano > wrote: >> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but >> that's not UTF-8, that's UTF-8-plus-extra-codepoints. > > And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even > though mathematically they would translate into U+0000 and U+D800 > respectively. The UTF-16 *mechanism* is limited to no more than Unicode > has currently used, but I'm left wondering if that's actually the other > way around - that Unicode planes were deemed to stop at the point where > UTF-16 can't encode any more. Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8 specification, allowing for 31 bits. Later revisions of the standard imposed the UTF-16 limit on Unicode as a whole.