Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5494 > unrolled thread
| Started by | Karim <karim.liateni@free.fr> |
|---|---|
| First post | 2011-05-16 09:26 +0200 |
| Last post | 2011-05-16 09:26 +0200 |
| Articles | 1 — 1 participant |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Trying to understand html.parser.HTMLParser Karim <karim.liateni@free.fr> - 2011-05-16 09:26 +0200
| From | Karim <karim.liateni@free.fr> |
|---|---|
| Date | 2011-05-16 09:26 +0200 |
| Subject | Re: Trying to understand html.parser.HTMLParser |
| Message-ID | <mailman.1630.1305530799.9059.python-list@python.org> |
On 05/16/2011 03:06 AM, David Robinow wrote:
> On Sun, May 15, 2011 at 4:45 PM, Andrew Berg<bahamutzero8825@gmail.com> wrote:
>> I'm trying to understand why HMTLParser.feed() isn't returning the whole
>> page. My test script is this:
>>
>> import urllib.request
>> import html.parser
>> class MyHTMLParser(html.parser.HTMLParser):
>> def handle_starttag(self, tag, attrs):
>> if tag == 'a' and attrs:
>> print(tag,'-',attrs)
>>
>> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
>> page = urllib.request.urlopen(url).read()
>> parser = MyHTMLParser()
>> parser.feed(str(page))
>>
>> I can do print(page) and get the entire HTML source, but
>> parser.feed(str(page)) only spits out the information for the top links
>> and none of the "revisionxxxx" links. Ultimately, I just want to find
>> the name of the first "revisionxxxx" link (right now it's
>> "revision1995", when a new build is uploaded it will be "revision2000"
>> or whatever). I figure this is a relatively simple page; once I
>> understand all of this, I can move on to more complicated pages.
> You've got bad HTML. Look closely and you'll see the there's no space
> between the "revisionxxxx" strings and the style tag following.
> The parser doesn't like this. I don't know a solution other than
> fixing the html.
> (I created a local copy, edited it and it worked.)
Hello,
Use regular expression for bad HTLM or beautifulSoup (google it), below
a exemple to extract all html links:
linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
for link in linksList:
print link
Cheers
Karim
Back to top | Article view | comp.lang.python
csiph-web