Re: Encoding of surrogate code points to UTF-8

From	Neil Cerutti <neilc@norwich.edu>
Newsgroups	comp.lang.python
Subject	Re: Encoding of surrogate code points to UTF-8
Date	2013-10-08 15:14 +0000
Organization	Norwich University
Message-ID	<bbilqpF6ep5U1@mid.individual.net> (permalink)
References	<52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw

On 2013-10-08, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to 
> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code 
> point. But I can't:
>
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in 
> position 0: surrogates not allowed
>
> Have I misunderstood? I think that Python is being too strict
> about rejecting surrogate code points. It should only reject
> lone surrogates, or invalid pairs, not valid pairs. Have I
> misunderstood the Unicode FAQs, or is this a bug in Python's
> handling of UTF-8?

From RFC 3629:

  The definition of UTF-8 prohibits encoding character numbers
  between U+D800 and U+DFFF, which are reserved for use with the
  UTF-16 encoding form (as surrogate pairs) and do not directly
  represent characters.  When encoding in UTF-8 from UTF-16 data,
  it is necessary to first decode the UTF-16 data to obtain
  character numbers, which are then encoded in UTF-8 as described
  above.  This contrasts with CESU-8 [CESU-8], which is a
  UTF-8-like encoding that is not meant for use on the Internet.
  CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code
  values (16-bit quantities) instead of the character number
  (code point).  This leads to different results for character
  numbers above 0xFFFF; the CESU-8 encoding of those characters
  is NOT valid UTF-8.

The Wikipedia article points out:

  Whether an actual application should [refuse to encode these
  character numbers] is debatable, as it makes it impossible to
  store invalid UTF-16 (that is, UTF-16 with unpaired surrogate
  halves) in a UTF-8 string. This is necessary to store unchecked
  UTF-16 such as Windows filenames as UTF-8. It is also
  incompatible with CESU encoding (described below).

So Python's interpretation is conformant, though not without some
disadvantages.

In any case, "\ud800\udc01" isn't a valid unicode string. In a
perfect world it would automatically get converted to
'\u00010001' without intervention.

-- 
Neil Cerutti

Thread

Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
  Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
    Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
    Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
      Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
        Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
          Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
            Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
              Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
  Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
    Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
      Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
      Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400

csiph-web