Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #56447
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Subject | Re: Encoding of surrogate code points to UTF-8 |
| Date | 2013-10-08 18:17 -0400 |
| References | <52540e03$0$29984$c3e8da3$5496439d@news.astraweb.com> <l31ugr$fgh$1@ger.gmane.org> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.877.1381270636.18130.python-list@python.org> (permalink) |
On 10/8/2013 5:47 PM, Terry Reedy wrote:
> On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
>> But reading the previous entry in the FAQs:
>>
>> http://www.unicode.org/faq/utf_bom.html#utf8-4
>>
>> I interpret this as meaning that I should be able to encode valid pairs
>> of surrogates.
>
> It says you should be able to 'convert' them, and that the result for
> utf-8 encoding must be a single 4-bytes code for the corresponding
> supplementary codepoint.
To expand on this: The FAQ question is "How do I convert a UTF-16
surrogate pair such as <D800 DC00> to UTF-8?" utf-16 and utf-8 are both
byte (or double byte) encodings of codepoints. Direct conversion would
be 'transcoding', not encoding. Python has a few bytes transcoders and
one string transcoder (rot_13), listed at the end of
http://docs.python.org/3/library/codecs.html#python-specific-encodings
But in general, one must decode bytes to string and encode back to bytes.
>> So if I find a code point that encodes to a surrogate pair
>> in UTF-16:
>>
>> py> c = '\N{LINEAR B SYLLABLE B038 E}'
>> py> surr_pair = c.encode('utf-16be')
>> py> print(surr_pair)
>> b'\xd8\x00\xdc\x01'
>>
>> and then use those same values as the code points, I ought to be able to
>> encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
>> point.
I believe the utf encodings are defined as 1 to 1. If the above worked,
utf-8 would not be.
--
Terry Jan Reedy
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 13:52 +0000
Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:14 +0000
Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-08 15:54 +0000
Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:30 +0000
Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 21:28 -0400
Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve@pearwood.info> - 2013-10-09 06:20 +0000
Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-09 01:22 -0700
Re: Encoding of surrogate code points to UTF-8 Ned Batchelder <ned@nedbatchelder.com> - 2013-10-09 06:22 -0400
Re: Encoding of surrogate code points to UTF-8 Neil Cerutti <neilc@norwich.edu> - 2013-10-09 12:55 +0000
Re: Encoding of surrogate code points to UTF-8 Pete Forman <petef4+usenet@gmail.com> - 2013-10-08 16:23 +0100
Re: Encoding of surrogate code points to UTF-8 MRAB <python@mrabarnett.plus.com> - 2013-10-08 18:00 +0100
Re: Encoding of surrogate code points to UTF-8 wxjmfauth@gmail.com - 2013-10-08 11:24 -0700
Re: Encoding of surrogate code points to UTF-8 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-08 22:20 +0000
Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 17:47 -0400
Re: Encoding of surrogate code points to UTF-8 Terry Reedy <tjreedy@udel.edu> - 2013-10-08 18:17 -0400
csiph-web