Path: csiph.com!usenet.pasdenom.info!nntpfeed.proxad.net!proxad.net!feeder1-1.proxad.net!ecngs!feeder2.ecngs.de!87.79.20.101.MISMATCH!newsreader4.netcologne.de!news.netcologne.de!bcyclone02.am1.xlned.com!bcyclone02.am1.xlned.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sun, 03 May 2015 18:09:20 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Unicode surrogate pairs (Python 3.4)
References: <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> <mailman.67.1430665534.12865.python-list@python.org> <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk> <mailman.69.1430668429.12865.python-list@python.org> <slrnmkcj3j.230.jon+usenet@frosty.unequivocal.co.uk>
In-Reply-To: <slrnmkcj3j.230.jon+usenet@frosty.unequivocal.co.uk>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.73.1430672962.12865.python-list@python.org>
Lines: 25
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:89885

On 2015-05-03 17:26, Jon Ribbens wrote:
> On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote:
>> On 2015-05-03 16:32, Jon Ribbens wrote:
>>> That would, unfortunately, be "tell the Unicode Consortium to format
>>> their documents differently", which seems unlikely to happen. I'm
>>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>>>
>> That document looks like it's encoded in UTF-8.
>
> It is. But it also, for reasons best known to the Unicode Consortium,
> contains strings of the form \uXXXX which need to be parsed into the
> appropriate character, and some of *those* are then surrogate pairs,
> which need to be further converted.
>
Ah, so it's r"\udb40\udd9d". :-)

There's also a mistake in this bit:

"""
# Note that according to the \uXXXX escaping convention, a supplemental 
character (> 0x10FFFF) is represented
# by a sequence of two surrogate characters: the first between D800 and 
DBFF, and the second between DC00 and DFFF.
"""