Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=lKoKdHujkcnFHRG6hTw9ijvcum/4VGjcV5qnqgDoIYIhpKwHD6OE0LWTU1Cyp2gEey VxZA32RD6Yf6mbNFO+pBxVMBjoIYLjOVLEkf6cR3UndQ/iJdXIIRVDT8FmSRZjLbFyVh 6kPK0gKW9G9UrEU8EfAMlAndwtoAx44t5Ci9Q=
MIME-Version: 1.0
In-Reply-To: <bec262ea-9690-4efa-8ae0-1657b46af58d@glegroupsg2000goo.googlegroups.com>
References: <mailman.267.1308715226.1164.python-list@python.org> <bec262ea-9690-4efa-8ae0-1657b46af58d@glegroupsg2000goo.googlegroups.com>
Date: Thu, 23 Jun 2011 00:15:19 +0200
Subject: Re: Unicode codepoints
From: Vlastimil Brom <vlastimil.brom@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.302.1308780921.1164.python-list@python.org>
Lines: 39
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:8246

2011/6/22 Saul Spatz <saul.spatz@gmail.com>:
> Thanks. =C2=A0I agree with you about the generator. =C2=A0Using your firs=
t suggestion, code points above U+FFFF get separated into two "surrogate pa=
ir" characters fron UTF-16. =C2=A0So instead of U=3D10FFFF I get U+DBFF and=
 U+DFFF.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hi,
If you realy need the wide unicode functionality on a narrow unicode
python build and only need to get the string index of characters
including surrogate pairs counting as one item, you can build a list
of single characters or surrogate pairs, e.g.:

>>> surrog_txt=3Du"a=F0=90=8C=B0 =F0=90=8C=B1 =F0=90=8C=B2 =F0=90=8C=B3"
>>> surrog_txt
u'a\U00010330 \U00010331 \U00010332 \U00010333'
>>> print surrog_txt
a=F0=90=8C=B0 =F0=90=8C=B1 =F0=90=8C=B2 =F0=90=8C=B3
>>> list(surrog_txt)
[u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ',
u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33']
>>> import re
>>> re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt)
[u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ',
u'\U00010333']
>>>

this way, the indices, slices and len() would work on the
supplementary list as expected for a normal string; however it
probably won't be very efficient for longer texts.
Note that surrogates are not the only asymmetry between code points,
characters (and glyphs - to take the visual representation of those
into account) - there are combining diacritical marks, in various
combinations with precomposed diacritical characters, multiple
normalisation modes etc.

regards,
   vbr