Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #89876

Re: Unicode surrogate pairs (Python 3.4)

Path csiph.com!usenet.pasdenom.info!nntpfeed.proxad.net!proxad.net!feeder1-1.proxad.net!ecngs!feeder2.ecngs.de!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.002
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'handler': 0.05; 'subject:Python': 0.06; 'strict': 0.07; 'string': 0.09; 'conversions': 0.09; 'encode': 0.09; 'happen.': 0.09; 'url:unicode': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'ah,': 0.16; 'backslashes': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'ideally,': 0.16; 'subject:Unicode': 0.16; 'unlikely': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'trying': 0.19; "python's": 0.19; 'seems': 0.21; 'cc:addr:python.org': 0.22; 'this?': 0.23; "shouldn't": 0.24; 'unicode': 0.24; 'mon,': 0.24; 'cc:2**0': 0.24; 'header:In-Reply- To:1': 0.27; 'am,': 0.29; 'errors': 0.30; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; '(which': 0.31; 'allows': 0.31; 'there.': 0.32; 'languages': 0.32; 'quite': 0.32; 'subject: (': 0.35; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'in:': 0.36; 'url:org': 0.36; 'sure': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'read': 0.60; 'catch': 0.60; 'url:3': 0.61; 'first': 0.61; 'needing': 0.65; 'url:0': 0.67; 'yourself': 0.78; '2015': 0.84; 'to:none': 0.92
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=PMlDl1wjQQd2hXys1XDcbMZYjorzfQot16jE4tR978M=; b=J39jEFXaoKDrB/ZMN5MHN8hjBaS2KGj++YUHAoRpKaSHq4ADFIlI+5KOvom41wBrVF xJpaN8VlrkRrYhAyXfY/HsekWkvEMqpOsRM3oXk6pYeAS5/dq0PRk6dJj0Ozwmy3XdRc RBX2KBP1om52CCGohwCZk93bqGss9bBkosQ2dl4i5aooVXAW1/uH73QpH9o3goHmTdMt VPS2K+VEN0zXK9OybWkUVB3dfZeAVd2FaW0PbWjWpAN5Ph7vBE6V8cXynrNKGqcPgzCN ORQfY/UYvsxOKLPr6s6u/O/SaQLcZQk24a1M2GMGr4Y0wrVmGs4n6Dc137e2An2Q7YVy +kJg==
MIME-Version 1.0
X-Received by 10.107.16.32 with SMTP id y32mr23128588ioi.53.1430668127535; Sun, 03 May 2015 08:48:47 -0700 (PDT)
In-Reply-To <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk>
References <slrnmkccs4.apd.jon+usenet@frosty.unequivocal.co.uk> <mailman.67.1430665534.12865.python-list@python.org> <slrnmkcftt.230.jon+usenet@frosty.unequivocal.co.uk>
Date Mon, 4 May 2015 01:48:47 +1000
Subject Re: Unicode surrogate pairs (Python 3.4)
From Chris Angelico <rosuav@gmail.com>
Cc "python-list@python.org" <python-list@python.org>
Content-Type text/plain; charset=UTF-8
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.68.1430668130.12865.python-list@python.org> (permalink)
Lines 19
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1430668130 news.xs4all.nl 2873 [2001:888:2000:d::a6]:59258
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:89876

Show key headers only | View raw


On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
<jon+usenet@unequivocal.co.uk> wrote:
>> You shouldn't even actually _have_ those in your string in the first
>> place. How did you construct/receive that data? Ideally, catch it at
>> that point, and deal with it there.
>
> That would, unfortunately, be "tell the Unicode Consortium to format
> their documents differently", which seems unlikely to happen. I'm
> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
are in your input. I'm not sure what the best way to deal with that
is... it's a bit of a mess. You may find yourself needing to do
something manually, unless there's a way to ask Python to encode to
pseudo-UCS-2 that allows surrogates. Some languages may have sloppy
conversions available, but Python's seems to be quite strict (which is
correct). Is there an errors handler that can do this?

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 14:40 +0000
  Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:05 +1000
    Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 15:32 +0000
      Re: Unicode surrogate pairs (Python 3.4) Marko Rauhamaa <marko@pacujo.net> - 2015-05-03 18:35 +0300
      Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 01:48 +1000
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:30 +0000
          Re: Unicode surrogate pairs (Python 3.4) Chris Angelico <rosuav@gmail.com> - 2015-05-04 02:47 +1000
      Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 16:53 +0100
        Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 16:26 +0000
          Re: Unicode surrogate pairs (Python 3.4) MRAB <python@mrabarnett.plus.com> - 2015-05-03 18:09 +0100
            Re: Unicode surrogate pairs (Python 3.4) Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-03 19:20 +0000

csiph-web