Groups > comp.lang.python > #8174 > unrolled thread

Unicode codepoints

Started by	Saul Spatz <saul.spatz@gmail.com>
First post	2011-06-21 20:37 -0700
Last post	2011-06-22 03:00 -0700
Articles	5 — 5 participants

Back to article view | Back to comp.lang.python

  Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-21 20:37 -0700
    Re: Unicode codepoints Chris Angelico <rosuav@gmail.com> - 2011-06-22 14:00 +1000
    Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-22 10:42 +0200
    Re: Unicode codepoints Peter Otten <__peter__@web.de> - 2011-06-22 11:00 +0200
      Re: Unicode codepoints jmfauth <wxjmfauth@gmail.com> - 2011-06-22 03:00 -0700

#8174 — Unicode codepoints

From	Saul Spatz <saul.spatz@gmail.com>
Date	2011-06-21 20:37 -0700
Subject	Unicode codepoints
Message-ID	<ae8fd9c1-88af-41ef-abb5-3a1883634d0e@glegroupsg2000goo.googlegroups.com>

Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

def codePoints(s):
    ''' return a list of the Unicode codepoints in the string s '''
    answer = []
    skip = False
    for k, c in enumerate(s):
        if skip:
            skip = False
            answer.append(ord(s[k-1:k+1]))
            continue
        if not 0xd800 <= ord(c) <= 0xdfff:
            answer.append(ord(c))
        else:
            skip = True
    return answer
            
if __name__ == '__main__':
    s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
    code = codePoints(s)
    for c in code:
        print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

[toc] | [next] | [standalone]

#8179

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-22 14:00 +1000
Message-ID	<mailman.267.1308715226.1164.python-list@python.org>
In reply to	#8174

On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.spatz@gmail.com> wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?

for c in s:
  print('U+'+hex(ord(c))[2:])

But if you do need the codePoints() function, I'd do it as a generator.

> def codePoints(s):
>    ''' return a list of the Unicode codepoints in the string s '''
>    skip = False
>    for k, c in enumerate(s):
>        if skip:
>            skip = False
>            yield ord(s[k-1:k+1])
>            continue
>        if not 0xd800 <= ord(c) <= 0xdfff:
>            yield ord(c)
>        else:
>            skip = True

Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.

But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#8191

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-06-22 10:42 +0200
Message-ID	<mailman.275.1308732140.1164.python-list@python.org>
In reply to	#8174

2011/6/22 Saul Spatz <saul.spatz@gmail.com>:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?
>
> def codePoints(s):
>    ''' return a list of the Unicode codepoints in the string s '''
>    answer = []
>    skip = False
>    for k, c in enumerate(s):
>        if skip:
>            skip = False
>            answer.append(ord(s[k-1:k+1]))
>            continue
>        if not 0xd800 <= ord(c) <= 0xdfff:
>            answer.append(ord(c))
>        else:
>            skip = True
>    return answer
>
> if __name__ == '__main__':
>    s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
>    code = codePoints(s)
>    for c in code:
>        print('U+'+hex(c)[2:])
>
> Thanks for any help you can give me.
>
> Saul
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi,
what functionality should codePoints(...) add over just iterating
through the characters in the unicode string directly (besides
filtering out the surrogates)?

It seems, that you can just use

    s = open(r'C:\install\filter-utf-8.txt', encoding = 'utf8', errors
= 'replace').read()
    for c in s:
        print('U+'+hex(ord(c))[2:])

or eventually add the condition before the print:
    if not 0xd800 <= ord(c) <= 0xdfff:

you can also use string formatting to do the hex conversion and a more
usual zero padding; the print(...) calls would be:

"older style formatting"
        print("U+%04x"%(ord(c),))

or the newer, potentially more powerful way using format(...)
        print("U+{:04x}".format(ord(c)))

hth,
   vbr

[toc] | [prev] | [next] | [standalone]

#8193

From	Peter Otten <__peter__@web.de>
Date	2011-06-22 11:00 +0200
Message-ID	<itsav4$a97$1@solani.org>
In reply to	#8174

Saul Spatz wrote:

> Hi,
> 
> I'm just starting to learn a bit about Unicode. I want to be able to read
> a utf-8 encoded file, and print out the codepoints it encodes.  After many
> false starts, here's a script that seems to work, but it strikes me as
> awfully awkward and unpythonic.  Have you a better way?
> 
> def codePoints(s):
>     ''' return a list of the Unicode codepoints in the string s '''
>     answer = []
>     skip = False
>     for k, c in enumerate(s):
>         if skip:
>             skip = False
>             answer.append(ord(s[k-1:k+1]))
>             continue
>         if not 0xd800 <= ord(c) <= 0xdfff:
>             answer.append(ord(c))
>         else:
>             skip = True
>     return answer
>             
> if __name__ == '__main__':
>     s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
>     code = codePoints(s)
>     for c in code:
>         print('U+'+hex(c)[2:])
> 
> Thanks for any help you can give me.
> 
> Saul

Here's an alternative implementation that follows Chris' suggestion to use a 
generator:

def codepoints(s):
    s = iter(s)
    for c in s:
        if 0xd800 <= ord(c) <= 0xdfff:
            c += next(s, "")
        yield ord(c)

[toc] | [prev] | [next] | [standalone]

#8194

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-06-22 03:00 -0700
Message-ID	<071ac539-429b-4731-bcd2-4b5487be6a85@x12g2000yql.googlegroups.com>
In reply to	#8193

That seems to me correct.

>>> '\\u{:04x}'.format(ord(u'é'))
\u00e9
>>> '\\U{:08x}'.format(ord(u'é'))
\U000000e9
>>>

because

>>> u'\U00e9'
  File "<eta last command>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 0-5: end of string in escape sequence
>>> u'\U000000e9'
é
>>> u'\u00e9'
é
>>>

from this:

>>> u'éléphant\N{EURO SIGN}'
éléphant€
>>> u = u'éléphant\N{EURO SIGN}'
>>> ''.join(['\\u{:04x}'.format(ord(c)) for c in u])
\u00e9\u006c\u00e9\u0070\u0068\u0061\u006e\u0074\u20ac
>>>

Skipping surrogate pairs is a little bit a non sense,
because the purpose is to display code points!

[toc] | [prev] | [standalone]

csiph-web

Unicode codepoints

Contents

#8174 — Unicode codepoints

#8179

#8191

#8193

#8194