Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #5454 > unrolled thread

Re: Trying to understand html.parser.HTMLParser

Started byDavid Robinow <drobinow@gmail.com>
First post2011-05-15 21:06 -0400
Last post2011-05-15 21:06 -0400
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Trying to understand html.parser.HTMLParser David Robinow <drobinow@gmail.com> - 2011-05-15 21:06 -0400

#5454 — Re: Trying to understand html.parser.HTMLParser

FromDavid Robinow <drobinow@gmail.com>
Date2011-05-15 21:06 -0400
SubjectRe: Trying to understand html.parser.HTMLParser
Message-ID<mailman.1607.1305508012.9059.python-list@python.org>
On Sun, May 15, 2011 at 4:45 PM, Andrew Berg <bahamutzero8825@gmail.com> wrote:
> I'm trying to understand why HMTLParser.feed() isn't returning the whole
> page. My test script is this:
>
> import urllib.request
> import html.parser
> class MyHTMLParser(html.parser.HTMLParser):
>    def handle_starttag(self, tag, attrs):
>        if tag == 'a' and attrs:
>            print(tag,'-',attrs)
>
> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
> page = urllib.request.urlopen(url).read()
> parser = MyHTMLParser()
> parser.feed(str(page))
>
> I can do print(page) and get the entire HTML source, but
> parser.feed(str(page)) only spits out the information for the top links
> and none of the "revisionxxxx" links. Ultimately, I just want to find
> the name of the first "revisionxxxx" link (right now it's
> "revision1995", when a new build is uploaded it will be "revision2000"
> or whatever). I figure this is a relatively simple page; once I
> understand all of this, I can move on to more complicated pages.
You've got bad HTML. Look closely and you'll see the there's no space
between the "revisionxxxx" strings and the style tag following.
The parser doesn't like this. I don't know a solution other than
fixing the html.
(I created a local copy, edited it and it worked.)

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web