Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!weretis.net!feeder1.news.weretis.net!news.solani.org!.POSTED!not-for-mail From: Peter Otten <__peter__@web.de> Newsgroups: comp.lang.python Subject: Re: Unicode codepoints Followup-To: comp.lang.python Date: Wed, 22 Jun 2011 11:00:24 +0200 Organization: None Lines: 44 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Trace: solani.org 1308733220 10535 eJwFwQERACAIA8BKDmVIHMBb/wj++yY4ceg8Ltft0YtUqs22CKtGItYtMtSDqF7I6hectA8pyhFm (22 Jun 2011 09:00:20 GMT) X-Complaints-To: abuse@news.solani.org NNTP-Posting-Date: Wed, 22 Jun 2011 09:00:20 +0000 (UTC) X-User-ID: eJwVx8ERwDAIA7CV0mCbZhwgsP8Iveonmh6VQxQ4nDPdEvaedVpR+DeZcpCx4rQlN13tdYGSaeULFXGZFh9YARVy Cancel-Lock: sha1:eXin7CEkOQ01Qq1lDUemLLqJas4= X-NNTP-Posting-Host: eJwFwYkBwCAIA8CViDzqODaQ/UfoXXqhuKOyIpXC6Vg2SwRYnpVq42o3T1Bk3PzkHFHvwGZOF7bphTxfvDXCbnylez/jD+LqGs4= Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:8193 Saul Spatz wrote: > Hi, > > I'm just starting to learn a bit about Unicode. I want to be able to read > a utf-8 encoded file, and print out the codepoints it encodes. After many > false starts, here's a script that seems to work, but it strikes me as > awfully awkward and unpythonic. Have you a better way? > > def codePoints(s): > ''' return a list of the Unicode codepoints in the string s ''' > answer = [] > skip = False > for k, c in enumerate(s): > if skip: > skip = False > answer.append(ord(s[k-1:k+1])) > continue > if not 0xd800 <= ord(c) <= 0xdfff: > answer.append(ord(c)) > else: > skip = True > return answer > > if __name__ == '__main__': > s = open('test.txt', encoding = 'utf8', errors = 'replace').read() > code = codePoints(s) > for c in code: > print('U+'+hex(c)[2:]) > > Thanks for any help you can give me. > > Saul Here's an alternative implementation that follows Chris' suggestion to use a generator: def codepoints(s): s = iter(s) for c in s: if 0xd800 <= ord(c) <= 0xdfff: c += next(s, "") yield ord(c)