Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #8174 > unrolled thread
| Started by | Saul Spatz <saul.spatz@gmail.com> |
|---|---|
| First post | 2011-06-21 20:37 -0700 |
| Last post | 2011-06-22 03:00 -0700 |
| Articles | 5 — 5 participants |
Back to article view | Back to comp.lang.python
Unicode codepoints Saul Spatz <saul.spatz@gmail.com> - 2011-06-21 20:37 -0700
Re: Unicode codepoints Chris Angelico <rosuav@gmail.com> - 2011-06-22 14:00 +1000
Re: Unicode codepoints Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-06-22 10:42 +0200
Re: Unicode codepoints Peter Otten <__peter__@web.de> - 2011-06-22 11:00 +0200
Re: Unicode codepoints jmfauth <wxjmfauth@gmail.com> - 2011-06-22 03:00 -0700
| From | Saul Spatz <saul.spatz@gmail.com> |
|---|---|
| Date | 2011-06-21 20:37 -0700 |
| Subject | Unicode codepoints |
| Message-ID | <ae8fd9c1-88af-41ef-abb5-3a1883634d0e@glegroupsg2000goo.googlegroups.com> |
Hi,
I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer
if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])
Thanks for any help you can give me.
Saul
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-22 14:00 +1000 |
| Message-ID | <mailman.267.1308715226.1164.python-list@python.org> |
| In reply to | #8174 |
On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.spatz@gmail.com> wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?
for c in s:
print('U+'+hex(ord(c))[2:])
But if you do need the codePoints() function, I'd do it as a generator.
> def codePoints(s):
> ''' return a list of the Unicode codepoints in the string s '''
> skip = False
> for k, c in enumerate(s):
> if skip:
> skip = False
> yield ord(s[k-1:k+1])
> continue
> if not 0xd800 <= ord(c) <= 0xdfff:
> yield ord(c)
> else:
> skip = True
Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.
But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?
Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-06-22 10:42 +0200 |
| Message-ID | <mailman.275.1308732140.1164.python-list@python.org> |
| In reply to | #8174 |
2011/6/22 Saul Spatz <saul.spatz@gmail.com>:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
>
> def codePoints(s):
> ''' return a list of the Unicode codepoints in the string s '''
> answer = []
> skip = False
> for k, c in enumerate(s):
> if skip:
> skip = False
> answer.append(ord(s[k-1:k+1]))
> continue
> if not 0xd800 <= ord(c) <= 0xdfff:
> answer.append(ord(c))
> else:
> skip = True
> return answer
>
> if __name__ == '__main__':
> s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
> code = codePoints(s)
> for c in code:
> print('U+'+hex(c)[2:])
>
> Thanks for any help you can give me.
>
> Saul
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hi,
what functionality should codePoints(...) add over just iterating
through the characters in the unicode string directly (besides
filtering out the surrogates)?
It seems, that you can just use
s = open(r'C:\install\filter-utf-8.txt', encoding = 'utf8', errors
= 'replace').read()
for c in s:
print('U+'+hex(ord(c))[2:])
or eventually add the condition before the print:
if not 0xd800 <= ord(c) <= 0xdfff:
you can also use string formatting to do the hex conversion and a more
usual zero padding; the print(...) calls would be:
"older style formatting"
print("U+%04x"%(ord(c),))
or the newer, potentially more powerful way using format(...)
print("U+{:04x}".format(ord(c)))
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2011-06-22 11:00 +0200 |
| Message-ID | <itsav4$a97$1@solani.org> |
| In reply to | #8174 |
Saul Spatz wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read
> a utf-8 encoded file, and print out the codepoints it encodes. After many
> false starts, here's a script that seems to work, but it strikes me as
> awfully awkward and unpythonic. Have you a better way?
>
> def codePoints(s):
> ''' return a list of the Unicode codepoints in the string s '''
> answer = []
> skip = False
> for k, c in enumerate(s):
> if skip:
> skip = False
> answer.append(ord(s[k-1:k+1]))
> continue
> if not 0xd800 <= ord(c) <= 0xdfff:
> answer.append(ord(c))
> else:
> skip = True
> return answer
>
> if __name__ == '__main__':
> s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
> code = codePoints(s)
> for c in code:
> print('U+'+hex(c)[2:])
>
> Thanks for any help you can give me.
>
> Saul
Here's an alternative implementation that follows Chris' suggestion to use a
generator:
def codepoints(s):
s = iter(s)
for c in s:
if 0xd800 <= ord(c) <= 0xdfff:
c += next(s, "")
yield ord(c)
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2011-06-22 03:00 -0700 |
| Message-ID | <071ac539-429b-4731-bcd2-4b5487be6a85@x12g2000yql.googlegroups.com> |
| In reply to | #8193 |
That seems to me correct.
>>> '\\u{:04x}'.format(ord(u'é'))
\u00e9
>>> '\\U{:08x}'.format(ord(u'é'))
\U000000e9
>>>
because
>>> u'\U00e9'
File "<eta last command>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 0-5: end of string in escape sequence
>>> u'\U000000e9'
é
>>> u'\u00e9'
é
>>>
from this:
>>> u'éléphant\N{EURO SIGN}'
éléphant€
>>> u = u'éléphant\N{EURO SIGN}'
>>> ''.join(['\\u{:04x}'.format(ord(c)) for c in u])
\u00e9\u006c\u00e9\u0070\u0068\u0061\u006e\u0074\u20ac
>>>
Skipping surrogate pairs is a little bit a non sense,
because the purpose is to display code points!
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web