Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #89877
| Date | 2015-05-03 16:53 +0100 |
|---|---|
| From | MRAB <python@mrabarnett.plus.com> |
| Subject | Re: Unicode surrogate pairs (Python 3.4) |
| References | <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> <mailman.67.1430665534.12865.python-list@python.org> <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.69.1430668429.12865.python-list@python.org> (permalink) |
On 2015-05-03 16:32, Jon Ribbens wrote: > On 2015-05-03, Chris Angelico <rosuav@gmail.com> wrote: >> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens >><jon+usenet@unequivocal.co.uk> wrote: >>> If I have a string containing surrogate pairs like this in Python 3.4: >>> >>> "\udb40\udd9d" >>> >>> How do I convert it into the proper form: >>> >>> "\U000E019D" >>> >>> ? The answer appears not to be "unicodedata.normalize". >> >> No, it's not, because Unicode normalization is a very specific thing. >> You're looking for a fix for some kind of encoding issue; Unicode >> normalization translates between combining characters and combined >> characters. >> >> You shouldn't even actually _have_ those in your string in the first >> place. How did you construct/receive that data? Ideally, catch it at >> that point, and deal with it there. > > That would, unfortunately, be "tell the Unicode Consortium to format > their documents differently", which seems unlikely to happen. I'm > trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt > That document looks like it's encoded in UTF-8. >> But if you absolutely have to convert the surrogates, it ought to be >> possible to do a sloppy UCS-2 conversion to bytes, then a proper >> UTF-16 decode on the result. > > Python doesn't appear to have UCS-2 support, so I guess what you're > saying is that I have to write my own surrogate-decoder? This seems > a little surprising. >
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000
csiph-web