Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #89885

Re: Unicode surrogate pairs (Python 3.4)

Date 2015-05-03 18:09 +0100
From MRAB <python@mrabarnett.plus.com>
Subject Re: Unicode surrogate pairs (Python 3.4)
References <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> <mailman.67.1430665534.12865.python-list@python.org> <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk> <mailman.69.1430668429.12865.python-list@python.org> <slrnmkcj3j.230.jon+usenet@frosty.unequivocal.co.uk>
Newsgroups comp.lang.python
Message-ID <mailman.73.1430672962.12865.python-list@python.org> (permalink)

Show all headers | View raw


On 2015-05-03 17:26, Jon Ribbens wrote:
> On 2015-05-03, MRAB <python@mrabarnett.plus.com> wrote:
>> On 2015-05-03 16:32, Jon Ribbens wrote:
>>> That would, unfortunately, be "tell the Unicode Consortium to format
>>> their documents differently", which seems unlikely to happen. I'm
>>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>>>
>> That document looks like it's encoded in UTF-8.
>
> It is. But it also, for reasons best known to the Unicode Consortium,
> contains strings of the form \uXXXX which need to be parsed into the
> appropriate character, and some of *those* are then surrogate pairs,
> which need to be further converted.
>
Ah, so it's r"\udb40\udd9d". :-)

There's also a mistake in this bit:

"""
# Note that according to the \uXXXX escaping convention, a supplemental 
character (> 0x10FFFF) is represented
# by a sequence of two surrogate characters: the first between D800 and 
DBFF, and the second between DC00 and DFFF.
"""

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
  Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
    Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
      Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
      Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
          Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
      Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
          Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
            Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000

csiph-web