Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Neil Cerutti Newsgroups: comp.lang.python Subject: Re: Encoding of surrogate code points to UTF-8 Date: 8 Oct 2013 15:14:33 GMT Organization: Norwich University Lines: 57 Message-ID: References: <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: individual.net lNJLCOw8wwteZoL7zepOKQ3tMtQSue7IGKX5mwSYEUIl6XQVtQ Cancel-Lock: sha1:CGf0Cfojo5oIMVO22nt3PHlsgWg= User-Agent: slrn/0.9.9p1/mm/ao (Win32) Xref: csiph.com comp.lang.python:56414 On 2013-10-08, Steven D'Aprano wrote: > py> c = '\N{LINEAR B SYLLABLE B038 E}' > py> surr_pair = c.encode('utf-16be') > py> print(surr_pair) > b'\xd8\x00\xdc\x01' > > and then use those same values as the code points, I ought to be able to > encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code > point. But I can't: > > py> s = '\ud800\udc01' > py> s.encode('utf-8') > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in > position 0: surrogates not allowed > > Have I misunderstood? I think that Python is being too strict > about rejecting surrogate code points. It should only reject > lone surrogates, or invalid pairs, not valid pairs. Have I > misunderstood the Unicode FAQs, or is this a bug in Python's > handling of UTF-8? From RFC 3629: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above. This contrasts with CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for use on the Internet. CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8. The Wikipedia article points out: Whether an actual application should [refuse to encode these character numbers] is debatable, as it makes it impossible to store invalid UTF-16 (that is, UTF-16 with unpaired surrogate halves) in a UTF-8 string. This is necessary to store unchecked UTF-16 such as Windows filenames as UTF-8. It is also incompatible with CESU encoding (described below). So Python's interpretation is conformant, though not without some disadvantages. In any case, "\ud800\udc01" isn't a valid unicode string. In a perfect world it would automatically get converted to '\u00010001' without intervention. -- Neil Cerutti