Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #89876

Re: Unicode surrogate pairs (Python 3.4)

References <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> <mailman.67.1430665534.12865.python-list@python.org> <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk>
Date 2015-05-04 01:48 +1000
Subject Re: Unicode surrogate pairs (Python 3.4)
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.68.1430668130.12865.python-list@python.org> (permalink)

Show all headers | View raw


On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
are in your input. I'm not sure what the best way to deal with that
is... it's a bit of a mess. You may find yourself needing to do
something manually, unless there's a way to ask Python to encode to
pseudo-UCS-2 that allows surrogates. Some languages may have sloppy
conversions available, but Python's seems to be quite strict (which is
correct). Is there an errors handler that can do this?

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
  Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
    Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
      Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
      Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
          Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
      Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
          Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
            Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000

csiph-web