Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Mon, 31 Dec 2012 01:51:47 -0500
From: Dave Angel <d@davea.name>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121011 Thunderbird/16.0.1
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: how to get the source of html in lxml?
References: <CA+YdQ_4oFo7GfntaSxzdVsLCUSwgwJgdJrwVabgrUVbzPHKJcA@mail.gmail.com>
In-Reply-To: <CA+YdQ_4oFo7GfntaSxzdVsLCUSwgwJgdJrwVabgrUVbzPHKJcA@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: d@davea.name
Newsgroups: comp.lang.python
Message-ID: <mailman.1490.1356936724.29569.python-list@python.org>
Lines: 22
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:35839

On 12/31/2012 01:32 AM, contro opinion wrote:
> import urllibimport lxml.html
> down='http://blog.sina.com.cn/s/blog_71f3890901017hof.html'
> file=urllib.urlopen(down).read()
> root=lxml.html.document_fromstring(file)
> body=root.xpath('//div[@class="articalContent  "]')[0]print body.text_content()
>
> When i run the code, what i get is the text content ,how can i get the html
> source code of it?
>
>

That's got several syntax errors, but if you remove the parts with
errors, you'll find the html source in the misnamed variable 'file'. 
The read() method returns a string.



-- 

DaveA