Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #56431

Re: Encoding of surrogate code points to UTF-8

Date 2013-10-08 18:00 +0100
From MRAB <python@mrabarnett.plus.com>
Subject Re: Encoding of surrogate code points to UTF-8
References <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com> <86mwmjlraq.fsf@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.867.1381251660.18130.python-list@python.org> (permalink)

Show all headers | View raw


On 08/10/2013 16:23, Pete Forman wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
> [snip]
>> py> s = '\ud800\udc01'
>> py> s.encode('utf-8')
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
>> position 0: surrogates not allowed
>>
>>
>> Have I misunderstood? I think that Python is being too strict about
>> rejecting surrogate code points. It should only reject lone surrogates,
>> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
>> or is this a bug in Python's handling of UTF-8?
>
> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
>
> D75 Surrogate pair: A representation for a single abstract character
>    that consists of a sequence of two 16-bit code units, where the first
>    value of the pair is a high-surrogate code unit and the second value
>    is a low-surrogate code unit.
>
> * Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
>    EncodingForms.)
>
> * Isolated surrogate code units have no interpretation on their own.
>    Certain other isolated code units in other encoding forms also have no
>    interpretation on their own. For example, the isolated byte [\x80] has
>    no interpretation in UTF-8; it can be used only as part of a multibyte
>    sequence. (See Table 3-7). It could be argued that this line by itself
>    should raise an error.
>
>
> That first bullet indicates that it is indeed illegal to use surrogate
> pairs in UTF-8 or UTF-32.
>
The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
  Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
    Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
    Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
      Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
        Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
          Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
            Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
              Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
  Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
    Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
      Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
      Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400

csiph-web