Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #89869 > unrolled thread
| Started by | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| First post | 2015-05-03 14:40 +0000 |
| Last post | 2015-05-03 19:20 +0000 |
| Articles | 11 — 4 participants |
Back to article view | Back to comp.lang.python
Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000
| From | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| Date | 2015-05-03 14:40 +0000 |
| Subject | Unicode surrogate pairs (Python 3.4) |
| Message-ID | <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> |
If I have a string containing surrogate pairs like this in Python 3.4: "\udb40\udd9d" How do I convert it into the proper form: "\U000E019D" ? The answer appears not to be "unicodedata.normalize".
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-04 01:05 +1000 |
| Message-ID | <mailman.67.1430665534.12865.python-list@python.org> |
| In reply to | #89869 |
On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens <jon+usenet@unequivocal.co.uk> wrote: > If I have a string containing surrogate pairs like this in Python 3.4: > > "\udb40\udd9d" > > How do I convert it into the proper form: > > "\U000E019D" > > ? The answer appears not to be "unicodedata.normalize". No, it's not, because Unicode normalization is a very specific thing. You're looking for a fix for some kind of encoding issue; Unicode normalization translates between combining characters and combined characters. You shouldn't even actually _have_ those in your string in the first place. How did you construct/receive that data? Ideally, catch it at that point, and deal with it there. But if you absolutely have to convert the surrogates, it ought to be possible to do a sloppy UCS-2 conversion to bytes, then a proper UTF-16 decode on the result. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| Date | 2015-05-03 15:32 +0000 |
| Message-ID | <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk> |
| In reply to | #89870 |
On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote: > On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens ><jon+usenet@unequivocal.co.uk> wrote: >> If I have a string containing surrogate pairs like this in Python 3.4: >> >> "\udb40\udd9d" >> >> How do I convert it into the proper form: >> >> "\U000E019D" >> >> ? The answer appears not to be "unicodedata.normalize". > > No, it's not, because Unicode normalization is a very specific thing. > You're looking for a fix for some kind of encoding issue; Unicode > normalization translates between combining characters and combined > characters. > > You shouldn't even actually _have_ those in your string in the first > place. How did you construct/receive that data? Ideally, catch it at > that point, and deal with it there. That would, unfortunately, be "tell the Unicode Consortium to format their documents differently", which seems unlikely to happen. I'm trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt > But if you absolutely have to convert the surrogates, it ought to be > possible to do a sloppy UCS-2 conversion to bytes, then a proper > UTF-16 decode on the result. Python doesn't appear to have UCS-2 support, so I guess what you're saying is that I have to write my own surrogate-decoder? This seems a little surprising.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-05-03 18:35 +0300 |
| Message-ID | <87fv7disi2.fsf@elektro.pacujo.net> |
| In reply to | #89873 |
Jon Ribbens <jon+usenet@unequivocal.co.uk>: > Python doesn't appear to have UCS-2 support, so I guess what you're > saying is that I have to write my own surrogate-decoder? This seems a > little surprising. Try UTF-16. Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-04 01:48 +1000 |
| Message-ID | <mailman.68.1430668130.12865.python-list@python.org> |
| In reply to | #89873 |
On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens <jon+usenet@unequivocal.co.uk> wrote: >> You shouldn't even actually _have_ those in your string in the first >> place. How did you construct/receive that data? Ideally, catch it at >> that point, and deal with it there. > > That would, unfortunately, be "tell the Unicode Consortium to format > their documents differently", which seems unlikely to happen. I'm > trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes are in your input. I'm not sure what the best way to deal with that is... it's a bit of a mess. You may find yourself needing to do something manually, unless there's a way to ask Python to encode to pseudo-UCS-2 that allows surrogates. Some languages may have sloppy conversions available, but Python's seems to be quite strict (which is correct). Is there an errors handler that can do this? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| Date | 2015-05-03 16:30 +0000 |
| Message-ID | <slrnmkcjbf.230.jon+usenet@frosty.unequivocal.co.uk> |
| In reply to | #89876 |
On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
><jon+usenet@unequivocal.co.uk> wrote:
>> That would, unfortunately, be "tell the Unicode Consortium to format
>> their documents differently", which seems unlikely to happen. I'm
>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>
> Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
> are in your input.
Well, they were, but I already wrote code to convert them into the
strings I showed in my original post.
> I'm not sure what the best way to deal with that is... it's a bit of
> a mess. You may find yourself needing to do something manually,
> unless there's a way to ask Python to encode to pseudo-UCS-2 that
> allows surrogates. Some languages may have sloppy conversions
> available, but Python's seems to be quite strict (which is correct).
> Is there an errors handler that can do this?
I did some experimentation, and it looks like the answer is:
"\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16")
Thanks for your help!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-04 02:47 +1000 |
| Message-ID | <mailman.71.1430671665.12865.python-list@python.org> |
| In reply to | #89881 |
On Mon, May 4, 2015 at 2:30 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
> I did some experimentation, and it looks like the answer is:
>
> "\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16")
>
> Thanks for your help!
Ha! That's the one. I went poking around but couldn't find the name
for it. That's exactly the sloppy encoding that I was talking about.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2015-05-03 16:53 +0100 |
| Message-ID | <mailman.69.1430668429.12865.python-list@python.org> |
| In reply to | #89873 |
On 2015-05-03 16:32, Jon Ribbens wrote: > On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote: >> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens >><jon+usenet@unequivocal.co.uk> wrote: >>> If I have a string containing surrogate pairs like this in Python 3.4: >>> >>> "\udb40\udd9d" >>> >>> How do I convert it into the proper form: >>> >>> "\U000E019D" >>> >>> ? The answer appears not to be "unicodedata.normalize". >> >> No, it's not, because Unicode normalization is a very specific thing. >> You're looking for a fix for some kind of encoding issue; Unicode >> normalization translates between combining characters and combined >> characters. >> >> You shouldn't even actually _have_ those in your string in the first >> place. How did you construct/receive that data? Ideally, catch it at >> that point, and deal with it there. > > That would, unfortunately, be "tell the Unicode Consortium to format > their documents differently", which seems unlikely to happen. I'm > trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt > That document looks like it's encoded in UTF-8. >> But if you absolutely have to convert the surrogates, it ought to be >> possible to do a sloppy UCS-2 conversion to bytes, then a proper >> UTF-16 decode on the result. > > Python doesn't appear to have UCS-2 support, so I guess what you're > saying is that I have to write my own surrogate-decoder? This seems > a little surprising. >
[toc] | [prev] | [next] | [standalone]
| From | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| Date | 2015-05-03 16:26 +0000 |
| Message-ID | <slrnmkcj3j.230.jon+usenet@frosty.unequivocal.co.uk> |
| In reply to | #89877 |
On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote: > On 2015-05-03 16:32, Jon Ribbens wrote: >> That would, unfortunately, be "tell the Unicode Consortium to format >> their documents differently", which seems unlikely to happen. I'm >> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt >> > That document looks like it's encoded in UTF-8. It is. But it also, for reasons best known to the Unicode Consortium, contains strings of the form \uXXXX which need to be parsed into the appropriate character, and some of *those* are then surrogate pairs, which need to be further converted.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2015-05-03 18:09 +0100 |
| Message-ID | <mailman.73.1430672962.12865.python-list@python.org> |
| In reply to | #89880 |
On 2015-05-03 17:26, Jon Ribbens wrote: > On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote: >> On 2015-05-03 16:32, Jon Ribbens wrote: >>> That would, unfortunately, be "tell the Unicode Consortium to format >>> their documents differently", which seems unlikely to happen. I'm >>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt >>> >> That document looks like it's encoded in UTF-8. > > It is. But it also, for reasons best known to the Unicode Consortium, > contains strings of the form \uXXXX which need to be parsed into the > appropriate character, and some of *those* are then surrogate pairs, > which need to be further converted. > Ah, so it's r"\udb40\udd9d". :-) There's also a mistake in this bit: """ # Note that according to the \uXXXX escaping convention, a supplemental character (> 0x10FFFF) is represented # by a sequence of two surrogate characters: the first between D800 and DBFF, and the second between DC00 and DFFF. """
[toc] | [prev] | [next] | [standalone]
| From | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| Date | 2015-05-03 19:20 +0000 |
| Message-ID | <slrnmkct94.230.jon+usenet@frosty.unequivocal.co.uk> |
| In reply to | #89885 |
On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote: > There's also a mistake in this bit: > > """ > # Note that according to the \uXXXX escaping convention, a supplemental > character (> 0x10FFFF) is represented > # by a sequence of two surrogate characters: the first between D800 and > DBFF, and the second between DC00 and DFFF. > """ Do you mean that it should say "(> 0xFFFF)" ? Far be it from me to correct the Unicode Consortium on the subject of Unicode ;-)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web