Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #56397 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2013-10-08 13:52 +0000 |
| Last post | 2013-10-08 18:17 -0400 |
| Articles | 15 — 8 participants |
Back to article view | Back to comp.lang.python
Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-10-08 13:52 +0000 |
| Subject | Encoding of surrogate code points to UTF-8 |
| Message-ID | <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com> |
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
py> surr = '\udc80'
py> surr.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed
But reading the previous entry in the FAQs:
http://www.unicode.org/faq/utf_bom.html#utf8-4
I interpret this as meaning that I should be able to encode valid pairs
of surrogates. So if I find a code point that encodes to a surrogate pair
in UTF-16:
py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'
and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:
py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?
--
Steven
[toc] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-10-08 15:14 +0000 |
| Message-ID | <bbilqpF6ep5U1@mid.individual.net> |
| In reply to | #56397 |
On 2013-10-08, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to
> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
> point. But I can't:
>
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
>
> Have I misunderstood? I think that Python is being too strict
> about rejecting surrogate code points. It should only reject
> lone surrogates, or invalid pairs, not valid pairs. Have I
> misunderstood the Unicode FAQs, or is this a bug in Python's
> handling of UTF-8?
From RFC 3629:
The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters. When encoding in UTF-8 from UTF-16 data,
it is necessary to first decode the UTF-16 data to obtain
character numbers, which are then encoded in UTF-8 as described
above. This contrasts with CESU-8 [CESU-8], which is a
UTF-8-like encoding that is not meant for use on the Internet.
CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code
values (16-bit quantities) instead of the character number
(code point). This leads to different results for character
numbers above 0xFFFF; the CESU-8 encoding of those characters
is NOT valid UTF-8.
The Wikipedia article points out:
Whether an actual application should [refuse to encode these
character numbers] is debatable, as it makes it impossible to
store invalid UTF-16 (that is, UTF-16 with unpaired surrogate
halves) in a UTF-8 string. This is necessary to store unchecked
UTF-16 such as Windows filenames as UTF-8. It is also
incompatible with CESU encoding (described below).
So Python's interpretation is conformant, though not without some
disadvantages.
In any case, "\ud800\udc01" isn't a valid unicode string. In a
perfect world it would automatically get converted to
'\u00010001' without intervention.
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-10-08 15:54 +0000 |
| Message-ID | <bbio5mF6u6rU1@mid.individual.net> |
| In reply to | #56414 |
On 2013-10-08, Neil Cerutti <neilc@norwich.edu> wrote: > In any case, "\ud800\udc01" isn't a valid unicode string. In a > perfect world it would automatically get converted to > '\u00010001' without intervention. This last paragraph is erroneous. I must have had a typo in my testing. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-10-08 22:30 +0000 |
| Message-ID | <52548791$0$29984$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #56414 |
On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote: > In any case, "\ud800\udc01" isn't a valid unicode string. I don't think this is correct. Can you show me where the standard says that Unicode strings[1] may not contain surrogates? I think that is a critical point, and the FAQ conflates *encoded strings* (i.e. bytes using one of the UTCs) with *Unicode strings*. The string you give above is is a Unicode string containing two code points, the surrogates U+D800 U+DC01, which as far as I am concerned is a legal string (subject to somebody pointing me to a definitive source that proves it is not). However, it *may or may not* be encodable to bytes using UTF-8, -16 or -32. Just as there are byte sequences that cannot be generated by the UTFs, possibly there are code point sequences that cannot be converted to bytes using the UTFs. > In a perfect > world it would automatically get converted to '\u00010001' without > intervention. I certainly hope not, because Unicode string != UTF-16. This is equivalent to saying: When encoding the sequence of code points '\ud800\udc01' to UTF-8 bytes, you should get the same result as if you treated the sequence of code points as if it were bytes, decoded it using UTF-16, and then encoded using UTF-8. That would be a horrible, horrible design, since it privileges UTF-16 in a completely inappropriate way. I *really* hope I am wrong, but I fear that is my interpretation of the FAQ. [1] Sequences of Unicode code points. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-10-08 21:28 -0400 |
| Message-ID | <mailman.883.1381282120.18130.python-list@python.org> |
| In reply to | #56450 |
On 10/8/2013 6:30 PM, Steven D'Aprano wrote: > On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote: > >> In any case, "\ud800\udc01" isn't a valid unicode string. > > I don't think this is correct. Can you show me where the standard says > that Unicode strings[1] may not contain surrogates? I think that is a see below. > critical point, and the FAQ conflates *encoded strings* (i.e. bytes using > one of the UTCs) with *Unicode strings*. > > The string you give above is is a Unicode string containing two code > points, the surrogates U+D800 U+DC01, which as far as I am concerned is a > legal string (subject to somebody pointing me to a definitive source that > proves it is not). However, it *may or may not* be encodable to bytes > using UTF-8, -16 or -32. From chapter two of the standard. "Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes." http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; ... Each of the three Unicode encoding forms can be efficiently transformed into eith er of the other two without any loss of data." "Surrogates Area. The Surrogates Area contains only surrogate code points and no encoded characters. See Section 16.6, Surrogates Area, for more detail." Before utf-16, the surrogates area was, I believe, part of the Private Use Area (which now starts where surrogates end). I think it would have been better if they were no longer called code points, but simply utf-16 code units. > Just as there are byte sequences that cannot be generated by the UTFs, > possibly there are code point sequences that cannot be converted to bytes > using the UTFs. True, but not to the point. You switched from sequences of characters (unicode text), which is what both I and Neil are talking about, to sequences of codepoints which is a larger set when you include the non-character surrogate 'code points' that are not allowed in unicode text. http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404 "The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences." > [1] Sequences of Unicode code points. This is not the Standard's definition of 'unicode text'. It is also not its definition of 'unicode string'. "D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form." In other words, a Unicode string is a utf encoding of unicode text. The FSR adaptively uses a subset of possible sequences from all three, though only one utf is used for any particular string. -- D79 says what I claimed before: "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one." -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2013-10-09 06:20 +0000 |
| Message-ID | <5254f594$0$29976$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #56461 |
On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote: > On 10/8/2013 6:30 PM, Steven D'Aprano wrote: >> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote: >> >>> In any case, "\ud800\udc01" isn't a valid unicode string. >> >> I don't think this is correct. Can you show me where the standard says >> that Unicode strings[1] may not contain surrogates? I think that is a > > see below. > >> critical point, and the FAQ conflates *encoded strings* (i.e. bytes >> using one of the UTCs) with *Unicode strings*. >> >> The string you give above is is a Unicode string containing two code >> points, the surrogates U+D800 U+DC01, which as far as I am concerned is >> a legal string (subject to somebody pointing me to a definitive source >> that proves it is not). However, it *may or may not* be encodable to >> bytes using UTF-8, -16 or -32. > > From chapter two of the standard. > > "Plain text is a pure sequence of character codes; plain Unicode-encoded > text is therefore a sequence of Unicode character codes." Also there are many valid non-characters in Unicode, including 66 explicitly defined non-characters, plus the many surrogates. So defining Unicode strings in terms of characters is less than helpful, since it excludes a whole bunch of strings which aren't "text" since they include non-characters. Also, "character" in the context of Unicode is ambiguous, due to normalization and decomposition: a single character can have up to four distinct forms. http://www.macchiato.com/unicode/nfc-faq *Code points* are rigorously defined, not characters, which is why I have tried very hard to only refer to code points and bytes, not characters. > http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three > encoding forms can be used to represent the full range of encoded > characters in the Unicode Standard; ... Each of the three Unicode > encoding forms can be efficiently transformed into eith er of the other > two without any loss of data." This merely says "encodings encode characters". We know that encodings can also encode non-characters, at least *some* non-characters. The question is, can they encode surrogates? > "Surrogates Area. The Surrogates Area contains only surrogate code > points and no encoded characters. See Section 16.6, Surrogates Area, for > more detail." > > Before utf-16, the surrogates area was, I believe, part of the Private > Use Area (which now starts where surrogates end). I think it would have > been better if they were no longer called code points, but simply utf-16 > code units. Private Use is irrelevant, since strings certainly can contain Private Use code-points, and UTF encodings can encode them. >> Just as there are byte sequences that cannot be generated by the UTFs, >> possibly there are code point sequences that cannot be converted to >> bytes using the UTFs. > > True, but not to the point. You switched from sequences of characters > (unicode text), which is what both I and Neil are talking about, to > sequences of codepoints which is a larger set when you include the > non-character surrogate 'code points' that are not allowed in unicode > text. I never mentioned sequences of characters. I've always talked about code points. > http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404 > > "The Unicode Standard supports three character encoding forms: UTF-32, > UTF-16, and UTF-8. Each encoding form maps the Unicode code points > U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences." Ah! Now we're getting somewhere! I think you've hit the nail on the head: the three UTF forms explicitly exclude the surrogates. So I think we now have an answer: Surrogate code points can exist in Unicode strings, but cannot be encoded to bytes using the standard UTF-8, UTF-16 and UTF-32 encodings. There may be other encodings, or error handlers, which are capable of handling surrogates, but they aren't UTF-8. So I think this answers my question. (I reserve the right to change my mind after reading more of the standard.) Thank you to everyone who replied. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-09 01:22 -0700 |
| Message-ID | <4b728b1a-cc37-4541-80a2-68335f1d5e5f@googlegroups.com> |
| In reply to | #56471 |
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit : > > > > http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three > > > encoding forms can be used to represent the full range of encoded > > > characters in the Unicode Standard; ... Each of the three Unicode > > > encoding forms can be efficiently transformed into eith er of the other > > > two without any loss of data." > > Yes, and what Unicode.org does not say is that these coding schemes (like any coding scheme) should be used in an exclusive way. Probably, because it is too obvious to understand. jmf
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-10-09 06:22 -0400 |
| Message-ID | <mailman.892.1381314149.18130.python-list@python.org> |
| In reply to | #56478 |
On 10/9/13 4:22 AM, wxjmfauth@gmail.com wrote: > Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit : >> >>> http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three >>> encoding forms can be used to represent the full range of encoded >>> characters in the Unicode Standard; ... Each of the three Unicode >>> encoding forms can be efficiently transformed into eith er of the other >>> two without any loss of data." >> > Yes, > > and what Unicode.org does not say is that these coding > schemes (like any coding scheme) should be used in an > exclusive way. Can you clarify what you mean by "in an exclusive way"? --Ned. > Probably, because it is too obvious to understand. > > jmf > >
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-10-09 12:55 +0000 |
| Message-ID | <bbl22jFli8pU1@mid.individual.net> |
| In reply to | #56482 |
On 2013-10-09, Ned Batchelder <ned@nedbatchelder.com> wrote: > On 10/9/13 4:22 AM, wxjmfauth@gmail.com wrote: >> and what Unicode.org does not say is that these coding schemes >> (like any coding scheme) should be used in an exclusive way. > > Can you clarify what you mean by "in an exclusive way"? Ned, pay no attention to the person whalopping that dead horse. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Pete Forman <petef4+usenet@gmail.com> |
|---|---|
| Date | 2013-10-08 16:23 +0100 |
| Message-ID | <86mwmjlraq.fsf@gmail.com> |
| In reply to | #56397 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
> py> s = '\ud800\udc01'
> py> s.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
>
>
> Have I misunderstood? I think that Python is being too strict about
> rejecting surrogate code points. It should only reject lone surrogates,
> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
> or is this a bug in Python's handling of UTF-8?
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
D75 Surrogate pair: A representation for a single abstract character
that consists of a sequence of two 16-bit code units, where the first
value of the pair is a high-surrogate code unit and the second value
is a low-surrogate code unit.
* Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
EncodingForms.)
* Isolated surrogate code units have no interpretation on their own.
Certain other isolated code units in other encoding forms also have no
interpretation on their own. For example, the isolated byte [\x80] has
no interpretation in UTF-8; it can be used only as part of a multibyte
sequence. (See Table 3-7). It could be argued that this line by itself
should raise an error.
That first bullet indicates that it is indeed illegal to use surrogate
pairs in UTF-8 or UTF-32.
--
Pete Forman
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-10-08 18:00 +0100 |
| Message-ID | <mailman.867.1381251660.18130.python-list@python.org> |
| In reply to | #56415 |
On 08/10/2013 16:23, Pete Forman wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>
>> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
> [snip]
>> py> s = '\ud800\udc01'
>> py> s.encode('utf-8')
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
>> position 0: surrogates not allowed
>>
>>
>> Have I misunderstood? I think that Python is being too strict about
>> rejecting surrogate code points. It should only reject lone surrogates,
>> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
>> or is this a bug in Python's handling of UTF-8?
>
> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
>
> D75 Surrogate pair: A representation for a single abstract character
> that consists of a sequence of two 16-bit code units, where the first
> value of the pair is a high-surrogate code unit and the second value
> is a low-surrogate code unit.
>
> * Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
> EncodingForms.)
>
> * Isolated surrogate code units have no interpretation on their own.
> Certain other isolated code units in other encoding forms also have no
> interpretation on their own. For example, the isolated byte [\x80] has
> no interpretation in UTF-8; it can be used only as part of a multibyte
> sequence. (See Table 3-7). It could be argued that this line by itself
> should raise an error.
>
>
> That first bullet indicates that it is indeed illegal to use surrogate
> pairs in UTF-8 or UTF-32.
>
The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-08 11:24 -0700 |
| Message-ID | <f292b9d8-3d63-4848-ba3e-48839b0071e4@googlegroups.com> |
| In reply to | #56431 |
--------
>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
>>> '\ud800'.encode('utf-8')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0:
surrogates not allowed
>>> '\ud800'.encode('utf-32-be')
b'\x00\x00\xd8\x00'
>>> '\ud800'.encode('utf-32-le')
b'\x00\xd8\x00\x00'
>>> '\ud800'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xd8\x00\x00'
jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-10-08 22:20 +0000 |
| Message-ID | <52548549$0$29984$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #56431 |
On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote: > The only time you should get a surrogate pair in a Unicode string is in > a narrow build, which doesn't exist in Python 3.3 and later. Incorrect. py> sys.version '3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)]' py> s = '\ud800\udc01' py> print(len(s)) 2 py> import unicodedata as ud py> for c in s: ... print(ud.category(c)) ... Cs Cs s is a string containing two code points making up a surrogate pair. It is very frustrating that the Unicode FAQs don't always clearly distinguish between when they are talking about bytes and when they are talking about code points. This area about surrogates is one of places where they conflate the two. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-10-08 17:47 -0400 |
| Message-ID | <mailman.876.1381268848.18130.python-list@python.org> |
| In reply to | #56397 |
On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
>
> If I've read the Unicode FAQs correctly, you cannot encode *lone*
> surrogate code points into UTF-8:
>
> http://www.unicode.org/faq/utf_bom.html#utf8-5
>
> Sure enough, using Python 3.3:
>
> py> surr = '\udc80'
I am pretty sure that if Python were being strict, that would raise an
error, as the result is not a valid unicode string. Allowing the above
or not was debated and laxness was allowed for at least the following
practical reasons.
1. Python itself uses the invalid surrogate codepoints for
surrogateescape error-handling.
http://www.python.org/dev/peps/pep-0383/
2. Invalid strings are needed for tests ;-)
-- like the one you do next.
3. Invalid strings may be needed for interfacing with other C APIs.
> py> surr.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
> position 0: surrogates not allowed
Default strict encoding (utf-8 or otherwise) will only encode valid
unicode strings. Encode invalid strings with surrogate codepoints with
surrogateescape error handling.
> But reading the previous entry in the FAQs:
>
> http://www.unicode.org/faq/utf_bom.html#utf8-4
>
> I interpret this as meaning that I should be able to encode valid pairs
> of surrogates.
It says you should be able to 'convert' them, and that the result for
utf-8 encoding must be a single 4-bytes code for the corresponding
supplementary codepoint.
> So if I find a code point that encodes to a surrogate pair
> in UTF-16:
>
> py> c = '\N{LINEAR B SYLLABLE B038 E}'
> py> surr_pair = c.encode('utf-16be')
> py> print(surr_pair)
> b'\xd8\x00\xdc\x01'
>
> and then use those same values as the code points, I ought to be able to
> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
> point. But I can't:
>
> py> s = '\ud800\udc01'
This is now a string with two invalid codepoints instead of one ;-).
As above, it would be rejected if Python were being strict.
> py> s.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
>
>
> Have I misunderstood? I think that Python is being too strict about
> rejecting surrogate code points.
No, it is being too lax about allowing them at all.
I believe there is an issue on the tracker (maybe closed) about the doc
for unicode escapes in string literals. Perhaps is should say more
clearly that inserting surrogates is allowed but results in an invalid
string that cannot be normally encoded.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-10-08 18:17 -0400 |
| Message-ID | <mailman.877.1381270636.18130.python-list@python.org> |
| In reply to | #56397 |
On 10/8/2013 5:47 PM, Terry Reedy wrote:
> On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
>> But reading the previous entry in the FAQs:
>>
>> http://www.unicode.org/faq/utf_bom.html#utf8-4
>>
>> I interpret this as meaning that I should be able to encode valid pairs
>> of surrogates.
>
> It says you should be able to 'convert' them, and that the result for
> utf-8 encoding must be a single 4-bytes code for the corresponding
> supplementary codepoint.
To expand on this: The FAQ question is "How do I convert a UTF-16
surrogate pair such as <D800 DC00> to UTF-8?" utf-16 and utf-8 are both
byte (or double byte) encodings of codepoints. Direct conversion would
be 'transcoding', not encoding. Python has a few bytes transcoders and
one string transcoder (rot_13), listed at the end of
http://docs.python.org/3/library/codecs.html#python-specific-encodings
But in general, one must decode bytes to string and encode back to bytes.
>> So if I find a code point that encodes to a surrogate pair
>> in UTF-16:
>>
>> py> c = '\N{LINEAR B SYLLABLE B038 E}'
>> py> surr_pair = c.encode('utf-16be')
>> py> print(surr_pair)
>> b'\xd8\x00\xdc\x01'
>>
>> and then use those same values as the code points, I ought to be able to
>> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
>> point.
I believe the utf encodings are defined as 1 to 1. If the above worked,
utf-8 would not be.
--
Terry Jan Reedy
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web