Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #56415

Re: Encoding of surrogate code points to UTF-8

From Pete Forman <petef4+usenet@gmail.com>
Newsgroups comp.lang.python
Subject Re: Encoding of surrogate code points to UTF-8
Date 2013-10-08 16:23 +0100
Organization A noiseless patient Spider
Message-ID <86mwmjlraq.fsf@gmail.com> (permalink)
References <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw


Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in 
> position 0: surrogates not allowed
>
>
> Have I misunderstood? I think that Python is being too strict about 
> rejecting surrogate code points. It should only reject lone surrogates, 
> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs, 
> or is this a bug in Python's handling of UTF-8?

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

D75 Surrogate pair: A representation for a single abstract character
  that consists of a sequence of two 16-bit code units, where the first
  value of the pair is a high-surrogate code unit and the second value
  is a low-surrogate code unit.

* Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
  EncodingForms.)

* Isolated surrogate code units have no interpretation on their own.
  Certain other isolated code units in other encoding forms also have no
  interpretation on their own. For example, the isolated byte [\x80] has
  no interpretation in UTF-8; it can be used only as part of a multibyte
  sequence. (See Table 3-7). It could be argued that this line by itself
  should raise an error.


That first bullet indicates that it is indeed illegal to use surrogate
pairs in UTF-8 or UTF-32.
-- 
Pete Forman

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
  Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
    Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
    Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
      Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
        Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
          Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
            Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
              Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
  Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
    Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
      Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
      Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
  Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400

csiph-web