Re: Unicode surrogate pairs (Python 3.4)

Date	2015-05-03 16:53 +0100
From	MRAB <python@mrabarnett.plus.com>
Subject	Re: Unicode surrogate pairs (Python 3.4)
References	<slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> <mailman.67.1430665534.12865.python-list@python.org> <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk>
Newsgroups	comp.lang.python
Message-ID	<mailman.69.1430668429.12865.python-list@python.org> (permalink)

Show all headers | View raw

On 2015-05-03 16:32, Jon Ribbens wrote:
> On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote:
>> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
>><jon+usenet@unequivocal.co.uk> wrote:
>>> If I have a string containing surrogate pairs like this in Python 3.4:
>>>
>>>   "\udb40\udd9d"
>>>
>>> How do I convert it into the proper form:
>>>
>>>   "\U000E019D"
>>>
>>> ? The answer appears not to be "unicodedata.normalize".
>>
>> No, it's not, because Unicode normalization is a very specific thing.
>> You're looking for a fix for some kind of encoding issue; Unicode
>> normalization translates between combining characters and combined
>> characters.
>>
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>
That document looks like it's encoded in UTF-8.

>> But if you absolutely have to convert the surrogates, it ought to be
>> possible to do a sloppy UCS-2 conversion to bytes, then a proper
>> UTF-16 decode on the result.
>
> Python doesn't appear to have UCS-2 support, so I guess what you're
> saying is that I have to write my own surrogate-decoder? This seems
> a little surprising.
>

Thread

Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
  Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
    Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
      Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
      Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
          Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
      Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
          Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
            Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000

csiph-web