Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=vZe4tlFtzpxp0nTXzKa1OB/pQeOCThQFESWjeGZDVuGqI5IVIfGN+YViMiQ1F6gUWE MCy2sQp7uACGBZ2C4nZ0aHSPd4AD7pDBSdnskbVMncYWQzcfoYa4XdDw3M78lCJsm+Jd DEQbytFR/xd6Ckun+GVrT94yTTiY+v4+4SUiE=
MIME-Version: 1.0
In-Reply-To: <ae8fd9c1-88af-41ef-abb5-3a1883634d0e@glegroupsg2000goo.googlegroups.com>
References: <ae8fd9c1-88af-41ef-abb5-3a1883634d0e@glegroupsg2000goo.googlegroups.com>
Date: Wed, 22 Jun 2011 10:42:17 +0200
Subject: Re: Unicode codepoints
From: Vlastimil Brom <vlastimil.brom@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.275.1308732140.1164.python-list@python.org>
Lines: 65
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:8191

2011/6/22 Saul Spatz <saul.spatz@gmail.com>:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read=
 a utf-8 encoded file, and print out the codepoints it encodes. =A0After ma=
ny false starts, here's a script that seems to work, but it strikes me as a=
wfully awkward and unpythonic. =A0Have you a better way?
>
> def codePoints(s):
> =A0 =A0''' return a list of the Unicode codepoints in the string s '''
> =A0 =A0answer =3D []
> =A0 =A0skip =3D False
> =A0 =A0for k, c in enumerate(s):
> =A0 =A0 =A0 =A0if skip:
> =A0 =A0 =A0 =A0 =A0 =A0skip =3D False
> =A0 =A0 =A0 =A0 =A0 =A0answer.append(ord(s[k-1:k+1]))
> =A0 =A0 =A0 =A0 =A0 =A0continue
> =A0 =A0 =A0 =A0if not 0xd800 <=3D ord(c) <=3D 0xdfff:
> =A0 =A0 =A0 =A0 =A0 =A0answer.append(ord(c))
> =A0 =A0 =A0 =A0else:
> =A0 =A0 =A0 =A0 =A0 =A0skip =3D True
> =A0 =A0return answer
>
> if __name__ =3D=3D '__main__':
> =A0 =A0s =3D open('test.txt', encoding =3D 'utf8', errors =3D 'replace').=
read()
> =A0 =A0code =3D codePoints(s)
> =A0 =A0for c in code:
> =A0 =A0 =A0 =A0print('U+'+hex(c)[2:])
>
> Thanks for any help you can give me.
>
> Saul
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi,
what functionality should codePoints(...) add over just iterating
through the characters in the unicode string directly (besides
filtering out the surrogates)?

It seems, that you can just use

    s =3D open(r'C:\install\filter-utf-8.txt', encoding =3D 'utf8', errors
=3D 'replace').read()
    for c in s:
        print('U+'+hex(ord(c))[2:])

or eventually add the condition before the print:
    if not 0xd800 <=3D ord(c) <=3D 0xdfff:

you can also use string formatting to do the hex conversion and a more
usual zero padding; the print(...) calls would be:

"older style formatting"
        print("U+%04x"%(ord(c),))

or the newer, potentially more powerful way using format(...)
        print("U+{:04x}".format(ord(c)))

hth,
   vbr