Groups > comp.lang.python > #8207 > unrolled thread

Re: Unicode codepoints

Started by	Saul Spatz <saul.spatz@gmail.com>
First post	2011-06-22 06:43 -0700
Last post	2011-06-23 00:15 +0200
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  Re: Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-22 06:43 -0700
    Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-23 00:15 +0200

#8207 — Re: Unicode codepoints

From	Saul Spatz <saul.spatz@gmail.com>
Date	2011-06-22 06:43 -0700
Subject	Re: Unicode codepoints
Message-ID	<bec262ea-9690-4efa-8ae0-1657b46af58d@glegroupsg2000goo.googlegroups.com>

Thanks.  I agree with you about the generator.  Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16.  So instead of U=10FFFF I get U+DBFF and U+DFFF.

[toc] | [next] | [standalone]

#8246

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-06-23 00:15 +0200
Message-ID	<mailman.302.1308780921.1164.python-list@python.org>
In reply to	#8207

2011/6/22 Saul Spatz <saul.spatz@gmail.com>:
> Thanks.  I agree with you about the generator.  Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16.  So instead of U=10FFFF I get U+DBFF and U+DFFF.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hi,
If you realy need the wide unicode functionality on a narrow unicode
python build and only need to get the string index of characters
including surrogate pairs counting as one item, you can build a list
of single characters or surrogate pairs, e.g.:

>>> surrog_txt=u"a𐌰 𐌱 𐌲 𐌳"
>>> surrog_txt
u'a\U00010330 \U00010331 \U00010332 \U00010333'
>>> print surrog_txt
a𐌰 𐌱 𐌲 𐌳
>>> list(surrog_txt)
[u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ',
u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33']
>>> import re
>>> re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt)
[u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ',
u'\U00010333']
>>>

this way, the indices, slices and len() would work on the
supplementary list as expected for a normal string; however it
probably won't be very efficient for longer texts.
Note that surrogates are not the only asymmetry between code points,
characters (and glyphs - to take the visual representation of those
into account) - there are combining diacritical marks, in various
combinations with precomposed diacritical characters, multiple
normalisation modes etc.

regards,
   vbr

[toc] | [prev] | [standalone]

csiph-web

Re: Unicode codepoints

Contents

#8207 — Re: Unicode codepoints

#8246