Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #41407 > unrolled thread

Read utf-8 file

Started bymoonhkt <moonhkt@gmail.com>
First post2013-03-18 02:34 -0700
Last post2013-03-18 11:37 +0100
Articles 2 — 2 participants

Back to article view | Back to comp.lang.python


Contents

  Read utf-8 file moonhkt <moonhkt@gmail.com> - 2013-03-18 02:34 -0700
    Re: Read utf-8 file Peter Otten <__peter__@web.de> - 2013-03-18 11:37 +0100

#41407 — Read utf-8 file

Frommoonhkt <moonhkt@gmail.com>
Date2013-03-18 02:34 -0700
SubjectRead utf-8 file
Message-ID<1f54b5e0-efac-4636-921d-aa087d05df44@a14g2000vbm.googlegroups.com>
File have China Made
中國 製

http://www.fileformat.info/info/unicode/char/4e2d/index.htm
UTF-16 (hex) 	0x4E2D (4e2d)
UTF-8 (hex) 	0xE4 0xB8 0xAD (e4b8ad)


Read by od -cx utf_a.text
0000000   中  **  **  國  **  **  製  **  **  \n
            e4b8    ade5    9c8b    e8a3    bd0a
0000012

Read by python, why python display as beow ?

中國製

u'\u4e2d\u570b\u88fd\n'  <--- Value 中國製
<-- UTF-8 value
u'\u4e2d' 中      CJK UNIFIED IDEOGRAPH-4E2D
u'\u570b' 國      CJK UNIFIED IDEOGRAPH-570B
u'\u88fd' 製      CJK UNIFIED IDEOGRAPH-88FD

import unicodedata
import codecs         # UNICODE
....

file = codecs.open(options.filename, 'r','utf-8' )
try:
  for line in file:
     #print repr(line)
     #print "========="
     print line.encode("utf")
     for keys in line.split(","):

       print repr(keys)  ," <--- Value" ,  keys.encode("utf") ,"<--
UTF-8 value"
       for key in keys:
         try:
            name = unicodedata.name(unicode(key))
            print "%-9s %-8s %-30s" % ( (repr(key)),
key.encode("utf") , name )


How to display
e4b8ad for 中 in python ?

[toc] | [next] | [standalone]


#41408

FromPeter Otten <__peter__@web.de>
Date2013-03-18 11:37 +0100
Message-ID<mailman.3421.1363603041.2939.python-list@python.org>
In reply to#41407
moonhkt wrote:

> How to display
> e4b8ad for 中 in python ?

Python 2

>>> print u"中".encode("utf-8").encode("hex")
e4b8ad


Python 3

>>> print(binascii.b2a_hex("中".encode("utf-8")).decode("ascii"))
e4b8ad

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web