Groups > comp.lang.python > #5494 > unrolled thread

Re: Trying to understand html.parser.HTMLParser

Started by	Karim <karim.liateni@free.fr>
First post	2011-05-16 09:26 +0200
Last post	2011-05-16 09:26 +0200
Articles	1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Trying to understand html.parser.HTMLParser Karim <karim.liateni@free.fr> - 2011-05-16 09:26 +0200

#5494 — Re: Trying to understand html.parser.HTMLParser

From	Karim <karim.liateni@free.fr>
Date	2011-05-16 09:26 +0200
Subject	Re: Trying to understand html.parser.HTMLParser
Message-ID	<mailman.1630.1305530799.9059.python-list@python.org>

On 05/16/2011 03:06 AM, David Robinow wrote:
> On Sun, May 15, 2011 at 4:45 PM, Andrew Berg<bahamutzero8825@gmail.com>  wrote:
>> I'm trying to understand why HMTLParser.feed() isn't returning the whole
>> page. My test script is this:
>>
>> import urllib.request
>> import html.parser
>> class MyHTMLParser(html.parser.HTMLParser):
>>     def handle_starttag(self, tag, attrs):
>>         if tag == 'a' and attrs:
>>             print(tag,'-',attrs)
>>
>> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
>> page = urllib.request.urlopen(url).read()
>> parser = MyHTMLParser()
>> parser.feed(str(page))
>>
>> I can do print(page) and get the entire HTML source, but
>> parser.feed(str(page)) only spits out the information for the top links
>> and none of the "revisionxxxx" links. Ultimately, I just want to find
>> the name of the first "revisionxxxx" link (right now it's
>> "revision1995", when a new build is uploaded it will be "revision2000"
>> or whatever). I figure this is a relatively simple page; once I
>> understand all of this, I can move on to more complicated pages.
> You've got bad HTML. Look closely and you'll see the there's no space
> between the "revisionxxxx" strings and the style tag following.
> The parser doesn't like this. I don't know a solution other than
> fixing the html.
> (I created a local copy, edited it and it worked.)
Hello,

Use regular expression for bad HTLM or beautifulSoup (google it), below 
a exemple to extract all html links:

linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
for link in linksList:
     print link

Cheers
Karim

[toc] | [standalone]

csiph-web

Re: Trying to understand html.parser.HTMLParser

Contents

#5494 — Re: Trying to understand html.parser.HTMLParser