Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #8246
| References | <mailman.267.1308715226.1164.python-list@python.org> <bec262ea-9690-4efa-8ae0-1657b46af58d@glegroupsg2000goo.googlegroups.com> |
|---|---|
| Date | 2011-06-23 00:15 +0200 |
| Subject | Re: Unicode codepoints |
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.302.1308780921.1164.python-list@python.org> (permalink) |
2011/6/22 Saul Spatz <saul.spatz@gmail.com>: > Thanks. I agree with you about the generator. Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16. So instead of U=10FFFF I get U+DBFF and U+DFFF. > -- > http://mail.python.org/mailman/listinfo/python-list > Hi, If you realy need the wide unicode functionality on a narrow unicode python build and only need to get the string index of characters including surrogate pairs counting as one item, you can build a list of single characters or surrogate pairs, e.g.: >>> surrog_txt=u"a𐌰 𐌱 𐌲 𐌳" >>> surrog_txt u'a\U00010330 \U00010331 \U00010332 \U00010333' >>> print surrog_txt a𐌰 𐌱 𐌲 𐌳 >>> list(surrog_txt) [u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ', u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33'] >>> import re >>> re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt) [u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ', u'\U00010333'] >>> this way, the indices, slices and len() would work on the supplementary list as expected for a normal string; however it probably won't be very efficient for longer texts. Note that surrogates are not the only asymmetry between code points, characters (and glyphs - to take the visual representation of those into account) - there are combining diacritical marks, in various combinations with precomposed diacritical characters, multiple normalisation modes etc. regards, vbr
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Re: Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-22 06:43 -0700 Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-23 00:15 +0200
csiph-web