Groups > comp.lang.python > #89869 > unrolled thread

Unicode surrogate pairs (Python 3.4)

Started by	Jon Ribbens <jon+usenet@unequivocal.co.uk>
First post	2015-05-03 14:40 +0000
Last post	2015-05-03 19:20 +0000
Articles	11 — 4 participants

Back to article view | Back to comp.lang.python

  Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
    Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
      Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
        Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
        Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
          Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
            Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
        Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
          Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
            Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
              Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000

#89869 — Unicode surrogate pairs (Python 3.4)

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-05-03 14:40 +0000
Subject	Unicode surrogate pairs (Python 3.4)
Message-ID	<slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk>

If I have a string containing surrogate pairs like this in Python 3.4:

  "\udb40\udd9d"

How do I convert it into the proper form:

  "\U000E019D"

? The answer appears not to be "unicodedata.normalize".

[toc] | [next] | [standalone]

#89870

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-04 01:05 +1000
Message-ID	<mailman.67.1430665534.12865.python-list@python.org>
In reply to	#89869

On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
> If I have a string containing surrogate pairs like this in Python 3.4:
>
>   "\udb40\udd9d"
>
> How do I convert it into the proper form:
>
>   "\U000E019D"
>
> ? The answer appears not to be "unicodedata.normalize".

No, it's not, because Unicode normalization is a very specific thing.
You're looking for a fix for some kind of encoding issue; Unicode
normalization translates between combining characters and combined
characters.

You shouldn't even actually _have_ those in your string in the first
place. How did you construct/receive that data? Ideally, catch it at
that point, and deal with it there. But if you absolutely have to
convert the surrogates, it ought to be possible to do a sloppy UCS-2
conversion to bytes, then a proper UTF-16 decode on the result.

ChrisA

[toc] | [prev] | [next] | [standalone]

#89873

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-05-03 15:32 +0000
Message-ID	<slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk>
In reply to	#89870

On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
><jon+usenet@unequivocal.co.uk> wrote:
>> If I have a string containing surrogate pairs like this in Python 3.4:
>>
>>   "\udb40\udd9d"
>>
>> How do I convert it into the proper form:
>>
>>   "\U000E019D"
>>
>> ? The answer appears not to be "unicodedata.normalize".
>
> No, it's not, because Unicode normalization is a very specific thing.
> You're looking for a fix for some kind of encoding issue; Unicode
> normalization translates between combining characters and combined
> characters.
>
> You shouldn't even actually _have_ those in your string in the first
> place. How did you construct/receive that data? Ideally, catch it at
> that point, and deal with it there.

That would, unfortunately, be "tell the Unicode Consortium to format
their documents differently", which seems unlikely to happen. I'm
trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

> But if you absolutely have to convert the surrogates, it ought to be
> possible to do a sloppy UCS-2 conversion to bytes, then a proper
> UTF-16 decode on the result.

Python doesn't appear to have UCS-2 support, so I guess what you're
saying is that I have to write my own surrogate-decoder? This seems
a little surprising.

[toc] | [prev] | [next] | [standalone]

#89874

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-05-03 18:35 +0300
Message-ID	<87fv7disi2.fsf@elektro.pacujo.net>
In reply to	#89873

Jon Ribbens <jon+usenet@unequivocal.co.uk>:

> Python doesn't appear to have UCS-2 support, so I guess what you're
> saying is that I have to write my own surrogate-decoder? This seems a
> little surprising.

Try UTF-16.


Marko

[toc] | [prev] | [next] | [standalone]

#89876

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-04 01:48 +1000
Message-ID	<mailman.68.1430668130.12865.python-list@python.org>
In reply to	#89873

On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
are in your input. I'm not sure what the best way to deal with that
is... it's a bit of a mess. You may find yourself needing to do
something manually, unless there's a way to ask Python to encode to
pseudo-UCS-2 that allows surrogates. Some languages may have sloppy
conversions available, but Python's seems to be quite strict (which is
correct). Is there an errors handler that can do this?

ChrisA

[toc] | [prev] | [next] | [standalone]

#89881

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-05-03 16:30 +0000
Message-ID	<slrnmkcjbf.230.jon+usenet@frosty.unequivocal.co.uk>
In reply to	#89876

On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
><jon+usenet@unequivocal.co.uk> wrote:
>> That would, unfortunately, be "tell the Unicode Consortium to format
>> their documents differently", which seems unlikely to happen. I'm
>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>
> Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
> are in your input.

Well, they were, but I already wrote code to convert them into the
strings I showed in my original post.

> I'm not sure what the best way to deal with that is... it's a bit of
> a mess. You may find yourself needing to do something manually,
> unless there's a way to ask Python to encode to pseudo-UCS-2 that
> allows surrogates. Some languages may have sloppy conversions
> available, but Python's seems to be quite strict (which is correct).
> Is there an errors handler that can do this?

I did some experimentation, and it looks like the answer is:

  "\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16")

Thanks for your help!

[toc] | [prev] | [next] | [standalone]

#89883

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-04 02:47 +1000
Message-ID	<mailman.71.1430671665.12865.python-list@python.org>
In reply to	#89881

On Mon, May 4, 2015 at 2:30 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
> I did some experimentation, and it looks like the answer is:
>
>   "\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16")
>
> Thanks for your help!

Ha! That's the one. I went poking around but couldn't find the name
for it. That's exactly the sloppy encoding that I was talking about.

ChrisA

[toc] | [prev] | [next] | [standalone]

#89877

From	MRAB <python@mrabarnett.plus.com>
Date	2015-05-03 16:53 +0100
Message-ID	<mailman.69.1430668429.12865.python-list@python.org>
In reply to	#89873

On 2015-05-03 16:32, Jon Ribbens wrote:
> On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote:
>> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
>><jon+usenet@unequivocal.co.uk> wrote:
>>> If I have a string containing surrogate pairs like this in Python 3.4:
>>>
>>>   "\udb40\udd9d"
>>>
>>> How do I convert it into the proper form:
>>>
>>>   "\U000E019D"
>>>
>>> ? The answer appears not to be "unicodedata.normalize".
>>
>> No, it's not, because Unicode normalization is a very specific thing.
>> You're looking for a fix for some kind of encoding issue; Unicode
>> normalization translates between combining characters and combined
>> characters.
>>
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>
That document looks like it's encoded in UTF-8.

>> But if you absolutely have to convert the surrogates, it ought to be
>> possible to do a sloppy UCS-2 conversion to bytes, then a proper
>> UTF-16 decode on the result.
>
> Python doesn't appear to have UCS-2 support, so I guess what you're
> saying is that I have to write my own surrogate-decoder? This seems
> a little surprising.
>

[toc] | [prev] | [next] | [standalone]

#89880

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-05-03 16:26 +0000
Message-ID	<slrnmkcj3j.230.jon+usenet@frosty.unequivocal.co.uk>
In reply to	#89877

On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote:
> On 2015-05-03 16:32, Jon Ribbens wrote:
>> That would, unfortunately, be "tell the Unicode Consortium to format
>> their documents differently", which seems unlikely to happen. I'm
>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>>
> That document looks like it's encoded in UTF-8.

It is. But it also, for reasons best known to the Unicode Consortium,
contains strings of the form \uXXXX which need to be parsed into the
appropriate character, and some of *those* are then surrogate pairs,
which need to be further converted.

[toc] | [prev] | [next] | [standalone]

#89885

From	MRAB <python@mrabarnett.plus.com>
Date	2015-05-03 18:09 +0100
Message-ID	<mailman.73.1430672962.12865.python-list@python.org>
In reply to	#89880

On 2015-05-03 17:26, Jon Ribbens wrote:
> On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote:
>> On 2015-05-03 16:32, Jon Ribbens wrote:
>>> That would, unfortunately, be "tell the Unicode Consortium to format
>>> their documents differently", which seems unlikely to happen. I'm
>>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>>>
>> That document looks like it's encoded in UTF-8.
>
> It is. But it also, for reasons best known to the Unicode Consortium,
> contains strings of the form \uXXXX which need to be parsed into the
> appropriate character, and some of *those* are then surrogate pairs,
> which need to be further converted.
>
Ah, so it's r"\udb40\udd9d". :-)

There's also a mistake in this bit:

"""
# Note that according to the \uXXXX escaping convention, a supplemental 
character (> 0x10FFFF) is represented
# by a sequence of two surrogate characters: the first between D800 and 
DBFF, and the second between DC00 and DFFF.
"""

[toc] | [prev] | [next] | [standalone]

#89890

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-05-03 19:20 +0000
Message-ID	<slrnmkct94.230.jon+usenet@frosty.unequivocal.co.uk>
In reply to	#89885

On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote:
> There's also a mistake in this bit:
>
> """
> # Note that according to the \uXXXX escaping convention, a supplemental 
> character (> 0x10FFFF) is represented
> # by a sequence of two surrogate characters: the first between D800 and 
> DBFF, and the second between DC00 and DFFF.
> """

Do you mean that it should say "(> 0xFFFF)" ? Far be it from me to
correct the Unicode Consortium on the subject of Unicode ;-)

[toc] | [prev] | [standalone]

csiph-web

Unicode surrogate pairs (Python 3.4)

Contents

#89869 — Unicode surrogate pairs (Python 3.4)

#89870

#89873

#89874

#89876

#89881

#89883

#89877

#89880

#89885

#89890