Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #8179
| References | <ae8fd9c1-88af-41ef-abb5-3a1883634d0e@glegroupsg2000goo.googlegroups.com> |
|---|---|
| Date | 2011-06-22 14:00 +1000 |
| Subject | Re: Unicode codepoints |
| From | Chris Angelico <rosuav@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.267.1308715226.1164.python-list@python.org> (permalink) |
On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.spatz@gmail.com> wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?
for c in s:
print('U+'+hex(ord(c))[2:])
But if you do need the codePoints() function, I'd do it as a generator.
> def codePoints(s):
> ''' return a list of the Unicode codepoints in the string s '''
> skip = False
> for k, c in enumerate(s):
> if skip:
> skip = False
> yield ord(s[k-1:k+1])
> continue
> if not 0xd800 <= ord(c) <= 0xdfff:
> yield ord(c)
> else:
> skip = True
Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.
But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?
Chris Angelico
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-21 20:37 -0700
Re: Unicode codepoints Chris Angelico <rosuav@gmail.com> - 2011-06-22 14:00 +1000
Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-22 10:42 +0200
Re: Unicode codepoints Peter Otten <__peter__@web.de> - 2011-06-22 11:00 +0200
Re: Unicode codepoints jmfauth <wxjmfauth@gmail.com> - 2011-06-22 03:00 -0700
csiph-web