Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #8179

Re: Unicode codepoints

References <ae8fd9c1-88af-41ef-abb5-3a1883634d0e@glegroupsg2000goo.googlegroups.com>
Date 2011-06-22 14:00 +1000
Subject Re: Unicode codepoints
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.267.1308715226.1164.python-list@python.org> (permalink)

Show all headers | View raw


On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.spatz@gmail.com> wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?

for c in s:
  print('U+'+hex(ord(c))[2:])

But if you do need the codePoints() function, I'd do it as a generator.

> def codePoints(s):
>    ''' return a list of the Unicode codepoints in the string s '''
>    skip = False
>    for k, c in enumerate(s):
>        if skip:
>            skip = False
>            yield ord(s[k-1:k+1])
>            continue
>        if not 0xd800 <= ord(c) <= 0xdfff:
>            yield ord(c)
>        else:
>            skip = True

Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.

But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?

Chris Angelico

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-21 20:37 -0700
  Re: Unicode codepoints Chris Angelico <rosuav@gmail.com> - 2011-06-22 14:00 +1000
  Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-22 10:42 +0200
  Re: Unicode codepoints Peter Otten <__peter__@web.de> - 2011-06-22 11:00 +0200
    Re: Unicode codepoints jmfauth <wxjmfauth@gmail.com> - 2011-06-22 03:00 -0700

csiph-web