Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.85.MISMATCH!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Mon, 16 May 2011 09:26:26 +0200
From: Karim <karim.liateni@free.fr>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Trying to understand html.parser.HTMLParser
References: <4DD03B69.6050301@gmail.com> <BANLkTikcc6wVX+aO7KATa8AK1BJJKN5kMw@mail.gmail.com>
In-Reply-To: <BANLkTikcc6wVX+aO7KATa8AK1BJJKN5kMw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1630.1305530799.9059.python-list@python.org>
Lines: 40
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:5494

On 05/16/2011 03:06 AM, David Robinow wrote:
> On Sun, May 15, 2011 at 4:45 PM, Andrew Berg<bahamutzero8825@gmail.com>  wrote:
>> I'm trying to understand why HMTLParser.feed() isn't returning the whole
>> page. My test script is this:
>>
>> import urllib.request
>> import html.parser
>> class MyHTMLParser(html.parser.HTMLParser):
>>     def handle_starttag(self, tag, attrs):
>>         if tag == 'a' and attrs:
>>             print(tag,'-',attrs)
>>
>> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
>> page = urllib.request.urlopen(url).read()
>> parser = MyHTMLParser()
>> parser.feed(str(page))
>>
>> I can do print(page) and get the entire HTML source, but
>> parser.feed(str(page)) only spits out the information for the top links
>> and none of the "revisionxxxx" links. Ultimately, I just want to find
>> the name of the first "revisionxxxx" link (right now it's
>> "revision1995", when a new build is uploaded it will be "revision2000"
>> or whatever). I figure this is a relatively simple page; once I
>> understand all of this, I can move on to more complicated pages.
> You've got bad HTML. Look closely and you'll see the there's no space
> between the "revisionxxxx" strings and the style tag following.
> The parser doesn't like this. I don't know a solution other than
> fixing the html.
> (I created a local copy, edited it and it worked.)
Hello,

Use regular expression for bad HTLM or beautifulSoup (google it), below 
a exemple to extract all html links:

linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
for link in linksList:
     print link

Cheers
Karim