Re: Unicode codepoints

References	<mailman.267.1308715226.1164.python-list@python.org> <bec262ea-9690-4efa-8ae0-1657b46af58d@glegroupsg2000goo.googlegroups.com>
Date	2011-06-23 00:15 +0200
Subject	Re: Unicode codepoints
From	Vlastimil Brom <vlastimil.brom@gmail.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.302.1308780921.1164.python-list@python.org> (permalink)

Show all headers | View raw

2011/6/22 Saul Spatz <saul.spatz@gmail.com>:
> Thanks.  I agree with you about the generator.  Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16.  So instead of U=10FFFF I get U+DBFF and U+DFFF.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hi,
If you realy need the wide unicode functionality on a narrow unicode
python build and only need to get the string index of characters
including surrogate pairs counting as one item, you can build a list
of single characters or surrogate pairs, e.g.:

>>> surrog_txt=u"a𐌰 𐌱 𐌲 𐌳"
>>> surrog_txt
u'a\U00010330 \U00010331 \U00010332 \U00010333'
>>> print surrog_txt
a𐌰 𐌱 𐌲 𐌳
>>> list(surrog_txt)
[u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ',
u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33']
>>> import re
>>> re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt)
[u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ',
u'\U00010333']
>>>

this way, the indices, slices and len() would work on the
supplementary list as expected for a normal string; however it
probably won't be very efficient for longer texts.
Note that surrogates are not the only asymmetry between code points,
characters (and glyphs - to take the visual representation of those
into account) - there are combining diacritical marks, in various
combinations with precomposed diacritical characters, multiple
normalisation modes etc.

regards,
   vbr

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

Re: Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-22 06:43 -0700
  Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-23 00:15 +0200

csiph-web