Groups > comp.lang.python > #56397 > unrolled thread

Encoding of surrogate code points to UTF-8

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2013-10-08 13:52 +0000
Last post	2013-10-08 18:17 -0400
Articles	15 — 8 participants

Back to article view | Back to comp.lang.python

  Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
    Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
      Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
      Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
        Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
          Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
            Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
              Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
                Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
    Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
      Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
        Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
        Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
    Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
    Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400

#56397 — Encoding of surrogate code points to UTF-8

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-08 13:52 +0000
Subject	Encoding of surrogate code points to UTF-8
Message-ID	<52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com>

I think this is a bug in Python's UTF-8 handling, but I'm not sure.

If I've read the Unicode FAQs correctly, you cannot encode *lone* 
surrogate code points into UTF-8:

http://www.unicode.org/faq/utf_bom.html#utf8-5

Sure enough, using Python 3.3:

py> surr = '\udc80'
py> surr.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in 
position 0: surrogates not allowed


But reading the previous entry in the FAQs:

http://www.unicode.org/faq/utf_bom.html#utf8-4

I interpret this as meaning that I should be able to encode valid pairs 
of surrogates. So if I find a code point that encodes to a surrogate pair 
in UTF-16:

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'


and then use those same values as the code points, I ought to be able to 
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code 
point. But I can't:


py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in 
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about 
rejecting surrogate code points. It should only reject lone surrogates, 
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs, 
or is this a bug in Python's handling of UTF-8?



-- 
Steven

[toc] | [next] | [standalone]

#56414

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-10-08 15:14 +0000
Message-ID	<bbilqpF6ep5U1@mid.individual.net>
In reply to	#56397

On 2013-10-08, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to 
> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code 
> point. But I can't:
>
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in 
> position 0: surrogates not allowed
>
> Have I misunderstood? I think that Python is being too strict
> about rejecting surrogate code points. It should only reject
> lone surrogates, or invalid pairs, not valid pairs. Have I
> misunderstood the Unicode FAQs, or is this a bug in Python's
> handling of UTF-8?

From RFC 3629:

  The definition of UTF-8 prohibits encoding character numbers
  between U+D800 and U+DFFF, which are reserved for use with the
  UTF-16 encoding form (as surrogate pairs) and do not directly
  represent characters.  When encoding in UTF-8 from UTF-16 data,
  it is necessary to first decode the UTF-16 data to obtain
  character numbers, which are then encoded in UTF-8 as described
  above.  This contrasts with CESU-8 [CESU-8], which is a
  UTF-8-like encoding that is not meant for use on the Internet.
  CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code
  values (16-bit quantities) instead of the character number
  (code point).  This leads to different results for character
  numbers above 0xFFFF; the CESU-8 encoding of those characters
  is NOT valid UTF-8.

The Wikipedia article points out:

  Whether an actual application should [refuse to encode these
  character numbers] is debatable, as it makes it impossible to
  store invalid UTF-16 (that is, UTF-16 with unpaired surrogate
  halves) in a UTF-8 string. This is necessary to store unchecked
  UTF-16 such as Windows filenames as UTF-8. It is also
  incompatible with CESU encoding (described below).

So Python's interpretation is conformant, though not without some
disadvantages.

In any case, "\ud800\udc01" isn't a valid unicode string. In a
perfect world it would automatically get converted to
'\u00010001' without intervention.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#56422

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-10-08 15:54 +0000
Message-ID	<bbio5mF6u6rU1@mid.individual.net>
In reply to	#56414

On 2013-10-08, Neil Cerutti <neilc@norwich.edu> wrote:
> In any case, "\ud800\udc01" isn't a valid unicode string. In a
> perfect world it would automatically get converted to
> '\u00010001' without intervention.

This last paragraph is erroneous. I must have had a typo in my
testing.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#56450

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-08 22:30 +0000
Message-ID	<52548791$0$29984$c3e8da3$5496439d@news.astraweb.com>
In reply to	#56414

On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:

> In any case, "\ud800\udc01" isn't a valid unicode string. 

I don't think this is correct. Can you show me where the standard says 
that Unicode strings[1] may not contain surrogates? I think that is a 
critical point, and the FAQ conflates *encoded strings* (i.e. bytes using 
one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code 
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a 
legal string (subject to somebody pointing me to a definitive source that 
proves it is not). However, it *may or may not* be encodable to bytes 
using UTF-8, -16 or -32.

Just as there are byte sequences that cannot be generated by the UTFs, 
possibly there are code point sequences that cannot be converted to bytes 
using the UTFs.

> In a perfect
> world it would automatically get converted to '\u00010001' without
> intervention.

I certainly hope not, because Unicode string != UTF-16. This is 
equivalent to saying:

When encoding the sequence of code points '\ud800\udc01' to UTF-8 bytes, 
you should get the same result as if you treated the sequence of code 
points as if it were bytes, decoded it using UTF-16, and then encoded 
using UTF-8.

That would be a horrible, horrible design, since it privileges UTF-16 in 
a completely inappropriate way. I *really* hope I am wrong, but I fear 
that is my interpretation of the FAQ.

[1] Sequences of Unicode code points.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#56461

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-10-08 21:28 -0400
Message-ID	<mailman.883.1381282120.18130.python-list@python.org>
In reply to	#56450

On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:
>
>> In any case, "\ud800\udc01" isn't a valid unicode string.
>
> I don't think this is correct. Can you show me where the standard says
> that Unicode strings[1] may not contain surrogates? I think that is a

see below.

> critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
> one of the UTCs) with *Unicode strings*.
>
> The string you give above is is a Unicode string containing two code
> points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
> legal string (subject to somebody pointing me to a definitive source that
> proves it is not). However, it *may or may not* be encodable to bytes
> using UTF-8, -16 or -32.

 From chapter two of the standard.

"Plain text is a pure sequence of character codes; plain Unicode-encoded 
text is therefore a sequence of Unicode character codes."

http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708
"All three encoding forms can be used to represent the full range of 
encoded characters in the Unicode Standard; ... Each of the three 
Unicode encoding forms can be efficiently transformed into eith
er of the other two without any loss of data."

"Surrogates Area. The Surrogates Area contains only surrogate code 
points and no encoded characters. See Section 16.6, Surrogates Area, for 
more detail."

Before utf-16, the surrogates area was, I believe, part of the Private 
Use Area (which now starts where surrogates end). I think it would have 
been better if they were no longer called code points, but simply utf-16 
code units.

> Just as there are byte sequences that cannot be generated by the UTFs,
> possibly there are code point sequences that cannot be converted to bytes
> using the UTFs.

True, but not to the point. You switched from sequences of characters 
(unicode text), which is what both I and Neil are talking about, to 
sequences of codepoints which is a larger set when you include the 
non-character surrogate 'code points' that are not allowed in unicode text.

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404

"The Unicode Standard supports three character encoding forms: UTF-32, 
UTF-16, and UTF-8. Each encoding form maps the Unicode code points 
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."

 > [1] Sequences of Unicode code points.

This is not the Standard's definition of 'unicode text'. It is also not 
its definition of 'unicode string'.

"D80 Unicode string: A code unit sequence containing code units of a 
particular Unicode encoding form."

In other words, a Unicode string is a utf encoding of unicode text. The 
FSR adaptively uses a subset of possible sequences from all three, 
though only one utf is used for any particular string.

--
D79 says what I claimed before: "The mapping of the set of Unicode 
scalar values to the set of code unit sequences for a Unicode encoding 
form is one-to-one."

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#56471

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-10-09 06:20 +0000
Message-ID	<5254f594$0$29976$c3e8da3$5496439d@news.astraweb.com>
In reply to	#56461

On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote:

> On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
>> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:
>>
>>> In any case, "\ud800\udc01" isn't a valid unicode string.
>>
>> I don't think this is correct. Can you show me where the standard says
>> that Unicode strings[1] may not contain surrogates? I think that is a
> 
> see below.
> 
>> critical point, and the FAQ conflates *encoded strings* (i.e. bytes
>> using one of the UTCs) with *Unicode strings*.
>>
>> The string you give above is is a Unicode string containing two code
>> points, the surrogates U+D800 U+DC01, which as far as I am concerned is
>> a legal string (subject to somebody pointing me to a definitive source
>> that proves it is not). However, it *may or may not* be encodable to
>> bytes using UTF-8, -16 or -32.
> 
>  From chapter two of the standard.
> 
> "Plain text is a pure sequence of character codes; plain Unicode-encoded
> text is therefore a sequence of Unicode character codes."

Also there are many valid non-characters in Unicode, including 66 
explicitly defined non-characters, plus the many surrogates. So defining 
Unicode strings in terms of characters is less than helpful, since it 
excludes a whole bunch of strings which aren't "text" since they include 
non-characters.

Also, "character" in the context of Unicode is ambiguous, due to 
normalization and decomposition: a single character can have up to four 
distinct forms.

http://www.macchiato.com/unicode/nfc-faq

*Code points* are rigorously defined, not characters, which is why I have 
tried very hard to only refer to code points and bytes, not characters.

> http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
> encoding forms can be used to represent the full range of encoded
> characters in the Unicode Standard; ... Each of the three Unicode
> encoding forms can be efficiently transformed into eith er of the other
> two without any loss of data."

This merely says "encodings encode characters". We know that encodings 
can also encode non-characters, at least *some* non-characters. The 
question is, can they encode surrogates?

> "Surrogates Area. The Surrogates Area contains only surrogate code
> points and no encoded characters. See Section 16.6, Surrogates Area, for
> more detail."
> 
> Before utf-16, the surrogates area was, I believe, part of the Private
> Use Area (which now starts where surrogates end). I think it would have
> been better if they were no longer called code points, but simply utf-16
> code units.

Private Use is irrelevant, since strings certainly can contain Private 
Use code-points, and UTF encodings can encode them.

>> Just as there are byte sequences that cannot be generated by the UTFs,
>> possibly there are code point sequences that cannot be converted to
>> bytes using the UTFs.
> 
> True, but not to the point. You switched from sequences of characters
> (unicode text), which is what both I and Neil are talking about, to
> sequences of codepoints which is a larger set when you include the
> non-character surrogate 'code points' that are not allowed in unicode
> text.

I never mentioned sequences of characters. I've always talked about code 
points.

> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404
> 
> "The Unicode Standard supports three character encoding forms: UTF-32,
> UTF-16, and UTF-8. Each encoding form maps the Unicode code points
> U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."

Ah! Now we're getting somewhere! I think you've hit the nail on the head: 
the three UTF forms explicitly exclude the surrogates. So I think we now 
have an answer:

Surrogate code points can exist in Unicode strings, but cannot be encoded 
to bytes using the standard UTF-8, UTF-16 and UTF-32 encodings.

There may be other encodings, or error handlers, which are capable of 
handling surrogates, but they aren't UTF-8. So I think this answers my 
question. (I reserve the right to change my mind after reading more of 
the standard.)

Thank you to everyone who replied.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#56478

From	wxjmfauth@gmail.com
Date	2013-10-09 01:22 -0700
Message-ID	<4b728b1a-cc37-4541-80a2-68335f1d5e5f@googlegroups.com>
In reply to	#56471

Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
> 
> 
> > http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
> 
> > encoding forms can be used to represent the full range of encoded
> 
> > characters in the Unicode Standard; ... Each of the three Unicode
> 
> > encoding forms can be efficiently transformed into eith er of the other
> 
> > two without any loss of data."
> 
> 

Yes, 

and what Unicode.org does not say is that these coding
schemes (like any coding scheme) should be used in an
exclusive way.

Probably, because it is too obvious to understand.

jmf

[toc] | [prev] | [next] | [standalone]

#56482

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-10-09 06:22 -0400
Message-ID	<mailman.892.1381314149.18130.python-list@python.org>
In reply to	#56478

On 10/9/13 4:22 AM, wxjmfauth@gmail.com wrote:
> Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
>>
>>> http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
>>> encoding forms can be used to represent the full range of encoded
>>> characters in the Unicode Standard; ... Each of the three Unicode
>>> encoding forms can be efficiently transformed into eith er of the other
>>> two without any loss of data."
>>
> Yes,
>
> and what Unicode.org does not say is that these coding
> schemes (like any coding scheme) should be used in an
> exclusive way.

Can you clarify what you mean by "in an exclusive way"?

--Ned.

> Probably, because it is too obvious to understand.
>
> jmf
>
>

[toc] | [prev] | [next] | [standalone]

#56484

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-10-09 12:55 +0000
Message-ID	<bbl22jFli8pU1@mid.individual.net>
In reply to	#56482

On 2013-10-09, Ned Batchelder <ned@nedbatchelder.com> wrote:
> On 10/9/13 4:22 AM, wxjmfauth@gmail.com wrote:
>> and what Unicode.org does not say is that these coding schemes
>> (like any coding scheme) should be used in an exclusive way.
>
> Can you clarify what you mean by "in an exclusive way"?

Ned, pay no attention to the person whalopping that dead horse.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#56415

From	Pete Forman <petef4+usenet@gmail.com>
Date	2013-10-08 16:23 +0100
Message-ID	<86mwmjlraq.fsf@gmail.com>
In reply to	#56397

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in 
> position 0: surrogates not allowed
>
>
> Have I misunderstood? I think that Python is being too strict about 
> rejecting surrogate code points. It should only reject lone surrogates, 
> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs, 
> or is this a bug in Python's handling of UTF-8?

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

D75 Surrogate pair: A representation for a single abstract character
  that consists of a sequence of two 16-bit code units, where the first
  value of the pair is a high-surrogate code unit and the second value
  is a low-surrogate code unit.

* Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
  EncodingForms.)

* Isolated surrogate code units have no interpretation on their own.
  Certain other isolated code units in other encoding forms also have no
  interpretation on their own. For example, the isolated byte [\x80] has
  no interpretation in UTF-8; it can be used only as part of a multibyte
  sequence. (See Table 3-7). It could be argued that this line by itself
  should raise an error.

That first bullet indicates that it is indeed illegal to use surrogate
pairs in UTF-8 or UTF-32.
-- 
Pete Forman

[toc] | [prev] | [next] | [standalone]

#56431

From	MRAB <python@mrabarnett.plus.com>
Date	2013-10-08 18:00 +0100
Message-ID	<mailman.867.1381251660.18130.python-list@python.org>
In reply to	#56415

On 08/10/2013 16:23, Pete Forman wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
> [snip]
>> py> s = '\ud800\udc01'
>> py> s.encode('utf-8')
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
>> position 0: surrogates not allowed
>>
>>
>> Have I misunderstood? I think that Python is being too strict about
>> rejecting surrogate code points. It should only reject lone surrogates,
>> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
>> or is this a bug in Python's handling of UTF-8?
>
> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
>
> D75 Surrogate pair: A representation for a single abstract character
>    that consists of a sequence of two 16-bit code units, where the first
>    value of the pair is a high-surrogate code unit and the second value
>    is a low-surrogate code unit.
>
> * Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
>    EncodingForms.)
>
> * Isolated surrogate code units have no interpretation on their own.
>    Certain other isolated code units in other encoding forms also have no
>    interpretation on their own. For example, the isolated byte [\x80] has
>    no interpretation in UTF-8; it can be used only as part of a multibyte
>    sequence. (See Table 3-7). It could be argued that this line by itself
>    should raise an error.
>
>
> That first bullet indicates that it is indeed illegal to use surrogate
> pairs in UTF-8 or UTF-32.
>
The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.

[toc] | [prev] | [next] | [standalone]

#56438

From	wxjmfauth@gmail.com
Date	2013-10-08 11:24 -0700
Message-ID	<f292b9d8-3d63-4848-ba3e-48839b0071e4@googlegroups.com>
In reply to	#56431

--------

>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
>>> '\ud800'.encode('utf-8')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: 
surrogates not allowed
>>> '\ud800'.encode('utf-32-be')
b'\x00\x00\xd8\x00'
>>> '\ud800'.encode('utf-32-le')
b'\x00\xd8\x00\x00'
>>> '\ud800'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xd8\x00\x00'


jmf

[toc] | [prev] | [next] | [standalone]

#56448

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-08 22:20 +0000
Message-ID	<52548549$0$29984$c3e8da3$5496439d@news.astraweb.com>
In reply to	#56431

On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote:

> The only time you should get a surrogate pair in a Unicode string is in
> a narrow build, which doesn't exist in Python 3.3 and later.

Incorrect.

py> sys.version
'3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat 
4.1.2-52)]'
py> s = '\ud800\udc01'
py> print(len(s))
2
py> import unicodedata as ud
py> for c in s:
...     print(ud.category(c))
...
Cs
Cs

s is a string containing two code points making up a surrogate pair.

It is very frustrating that the Unicode FAQs don't always clearly 
distinguish between when they are talking about bytes and when they are 
talking about code points. This area about surrogates is one of places 
where they conflate the two.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#56446

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-10-08 17:47 -0400
Message-ID	<mailman.876.1381268848.18130.python-list@python.org>
In reply to	#56397

On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
>
> If I've read the Unicode FAQs correctly, you cannot encode *lone*
> surrogate code points into UTF-8:
>
> http://www.unicode.org/faq/utf_bom.html#utf8-5
>
> Sure enough, using Python 3.3:
>
> py> surr = '\udc80'

I am pretty sure that if Python were being strict, that would raise an 
error, as the result is not a valid unicode string. Allowing the above 
or not was debated and laxness was allowed for at least the following 
practical reasons.

1. Python itself uses the invalid surrogate codepoints for 
surrogateescape error-handling.
http://www.python.org/dev/peps/pep-0383/

2. Invalid strings are needed for tests ;-)
-- like the one you do next.

3. Invalid strings may be needed for interfacing with other C APIs.

> py> surr.encode('utf-8')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
> position 0: surrogates not allowed

Default strict encoding (utf-8 or otherwise) will only encode valid 
unicode strings. Encode invalid strings with surrogate codepoints with 
surrogateescape error handling.

> But reading the previous entry in the FAQs:
>
> http://www.unicode.org/faq/utf_bom.html#utf8-4
>
> I interpret this as meaning that I should be able to encode valid pairs
> of surrogates.

It says you should be able to 'convert' them, and that the result for 
utf-8 encoding must be a single 4-bytes code for the corresponding 
supplementary codepoint.

> So if I find a code point that encodes to a surrogate pair
> in UTF-16:
>
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to
> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
> point. But I can't:
>
> py> s = '\ud800\udc01'

This is now a string with two invalid codepoints instead of one ;-).
As above, it would be rejected if Python were being strict.

> py> s.encode('utf-8')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
>
>
> Have I misunderstood? I think that Python is being too strict about
> rejecting surrogate code points.

No, it is being too lax about allowing them at all.

I believe there is an issue on the tracker (maybe closed) about the doc 
for unicode escapes in string literals. Perhaps is should say more 
clearly that inserting surrogates is allowed but results in an invalid 
string that cannot be normally encoded.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#56447

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-10-08 18:17 -0400
Message-ID	<mailman.877.1381270636.18130.python-list@python.org>
In reply to	#56397

On 10/8/2013 5:47 PM, Terry Reedy wrote:
> On 10/8/2013 9:52 AM, Steven D'Aprano wrote:

>> But reading the previous entry in the FAQs:
>>
>> http://www.unicode.org/faq/utf_bom.html#utf8-4
>>
>> I interpret this as meaning that I should be able to encode valid pairs
>> of surrogates.
>
> It says you should be able to 'convert' them, and that the result for
> utf-8 encoding must be a single 4-bytes code for the corresponding
> supplementary codepoint.

To expand on this: The FAQ question is "How do I convert a UTF-16 
surrogate pair such as <D800 DC00> to UTF-8?" utf-16 and utf-8 are both 
byte (or double byte) encodings of codepoints. Direct conversion would 
be 'transcoding', not encoding. Python has a few bytes transcoders and 
one string transcoder (rot_13), listed at the end of
http://docs.python.org/3/library/codecs.html#python-specific-encodings
But in general, one must decode bytes to string and encode back to bytes.

>> So if I find a code point that encodes to a surrogate pair
>> in UTF-16:
>>
>> py> c = '\N{LINEAR B SYLLABLE B038 E}'
>> py> surr_pair = c.encode('utf-16be')
>> py> print(surr_pair)
>> b'\xd8\x00\xdc\x01'
>>
>> and then use those same values as the code points, I ought to be able to
>> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
>> point.

I believe the utf encodings are defined as 1 to 1. If the above worked, 
utf-8 would not be.

-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]

csiph-web

Encoding of surrogate code points to UTF-8

Contents

#56397 — Encoding of surrogate code points to UTF-8

#56414

#56422

#56450

#56461

#56471

#56478

#56482

#56484

#56415

#56431

#56438

#56448

#56446

#56447