Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #28656 > unrolled thread

Re: lxml can't output right unicode result

Started byMRAB <python@mrabarnett.plus.com>
First post2012-09-07 02:14 +0100
Last post2012-09-07 02:14 +0100
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: lxml can't output right unicode result MRAB <python@mrabarnett.plus.com> - 2012-09-07 02:14 +0100

#28656 — Re: lxml can't output right unicode result

FromMRAB <python@mrabarnett.plus.com>
Date2012-09-07 02:14 +0100
SubjectRe: lxml can't output right unicode result
Message-ID<mailman.338.1346980441.27098.python-list@python.org>
On 07/09/2012 01:21, contro opinion wrote:
> i eidt a file and save it in gbk encode named test. my system is
> :debian,locale,en.utf-8;python2.6,locale,utf-8.
>
> <html>
> <p>你</p>
> </html>
>
> in terminal i input:
>
> xxd  test
>
> 0000000: 3c68 746d 6c3e 0a3c 703e c4e3 3c2f 703e  <html>.<p>..</p>
> 0000010: 0a3c 2f68 746d 6c3e 0a                   .</html>.
>
> 你 is you in english,
> "\xc4\xe3" is the gbk encode of it.
> "\xe4\xbd\xe3" is the utf-8 encode of it.
> "u\x4f\x60" is the unicode encode of it.
> now i parse it in lxml
>
>  >>> "你"
> '\xe4\xbd\xa0'
>  >>> "你".decode("utf-8")
> u'\u4f60'
>  >>> "你".decode("utf-8").encode("gbk")
> '\xc4\xe3'
>  >>>
>
> code1:
>
>  >>> import lxml.html
>  >>> root=lxml.html.parse("test")
>  >>> d=root.xpath("//p")
>  >>> d[0].text_content()
> u'\xc4\xe3'
>
> in material ,lxml parse file to output the unicode form.
> why the d[0].text_content() can not output u'\x4f\x60'?
>
> code2:
>
> import codecs
> import lxml.html
> f = codecs.open('test', 'r', 'gbk')
> root=lxml.html.parse(f)
> d=root.xpath("//p")
> d[0].text_content()
> u'\xe4\xbd\xa0'
>
> why the d[0].text_content() can not output u'\x4f\x60'?
>
> i am confused by this problem for two days.
>
You can't just put some text into a file and expect it to know
"magically" what the encoding is. You have to specify that the encoding
is GBK, something like this (in a file actually encoded as GBK, of
course):

<html>
<meta http-equiv="content-type" content="text/html; charset=gbk">
<p>你</p>
</html>

I hope there's a good reason why you're using that encoding and not
UTF-8.

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web