Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk>
References: <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk>
Date: Mon, 4 May 2015 01:05:31 +1000
Subject: Re: Unicode surrogate pairs (Python 3.4)
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.67.1430665534.12865.python-list@python.org>
Lines: 24
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:89870

On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
> If I have a string containing surrogate pairs like this in Python 3.4:
>
>   "\udb40\udd9d"
>
> How do I convert it into the proper form:
>
>   "\U000E019D"
>
> ? The answer appears not to be "unicodedata.normalize".

No, it's not, because Unicode normalization is a very specific thing.
You're looking for a fix for some kind of encoding issue; Unicode
normalization translates between combining characters and combined
characters.

You shouldn't even actually _have_ those in your string in the first
place. How did you construct/receive that data? Ideally, catch it at
that point, and deal with it there. But if you absolutely have to
convert the surrogates, it ought to be possible to do a sloppy UCS-2
conversion to bytes, then a proper UTF-16 decode on the result.

ChrisA