Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #5454

Re: Trying to understand html.parser.HTMLParser

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <drobinow@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.039
X-Spam-Evidence '*H*': 0.92; '*S*': 0.00; 'parser': 0.05; '(right': 0.09; 'pages.': 0.09; 'sun,': 0.09; 'pm,': 0.11; 'this:': 0.11; 'wrote:': 0.14; "'a'": 0.16; 'subject:Trying': 0.16; '\xa0def': 0.16; 'skip:m 30': 0.16; 'figure': 0.18; 'header:In-Reply-To:1': 0.22; 'trying': 0.23; '\xa0if': 0.23; 'script': 0.26; "i'm": 0.26; 'message-id:@mail.gmail.com': 0.28; "doesn't": 0.28; 'class': 0.29; 'this.': 0.30; 'edited': 0.31; 'html.': 0.31; 'import': 0.32; 'page.': 0.32; 'to:addr:python-list': 0.32; 'relatively': 0.33; 'created': 0.33; 'page': 0.33; 'test': 0.33; "isn't": 0.34; 'skip:" 10': 0.34; 'got': 0.34; 'closely': 0.35; 'fixing': 0.35; 'source,': 0.35; 'none': 0.36; 'received:209.85': 0.37; 'andrew': 0.38; 'strings': 0.38; 'received:google.com': 0.38; 'but': 0.38; 'returning': 0.39; 'to:addr:python.org': 0.39; 'received:209': 0.39; 'solution': 0.40; "it's": 0.40; 'header:Received:5': 0.40; 'simple': 0.60; 'url:nl': 0.60; 'skip:h 20': 0.60; '2011': 0.62; 'link': 0.62; 'following.': 0.84; 'tag,': 0.84; 'ultimately,': 0.84; 'url:dir': 0.84; 'page;': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=eagvRDERGSMpDo/0BSHQnIB0BXkaCLUxVrizMJd/mX4=; b=FjTkD50tUvG1emfmr0uSA9nrpGqCDzW0fwZjfHjzxALBSyuk+c0atDx4Y06oy13XJF /aNiWMxecBhJbL+cUwgI0fMsORzqids2SbpyiFUENKPk6kuqDrTpD5QOhjQLXvLSphfe Ue2NVMQgmmHdrqVcZPeoRTKIcZRIup1UEq78k=
DomainKey-Signature a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=EaZzRj++OdwZdjfwRzDOx0VHRhB4J6YhCiFp+9ScFHZ9qgGyFASKvjnLYltxy3UCrO Njy0Ltz0yoXWQTde2CQ5cCaBOx81o9NsoHwlbx8P7WqnzXBHsj1dCj329MNPZM0h3fCy WzpiOKU5iW7uiZX704BdM4RZa/pkO77JXpsSY=
MIME-Version 1.0
In-Reply-To <4DD03B69.6050301@gmail.com>
References <4DD03B69.6050301@gmail.com>
Date Sun, 15 May 2011 21:06:49 -0400
Subject Re: Trying to understand html.parser.HTMLParser
From David Robinow <drobinow@gmail.com>
To python-list@python.org
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.1607.1305508012.9059.python-list@python.org> (permalink)
Lines 29
NNTP-Posting-Host 82.94.164.166
X-Trace 1305508012 news.xs4all.nl 41110 [::ffff:82.94.164.166]:49382
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:5454

Show key headers only | View raw


On Sun, May 15, 2011 at 4:45 PM, Andrew Berg <bahamutzero8825@gmail.com> wrote:
> I'm trying to understand why HMTLParser.feed() isn't returning the whole
> page. My test script is this:
>
> import urllib.request
> import html.parser
> class MyHTMLParser(html.parser.HTMLParser):
>    def handle_starttag(self, tag, attrs):
>        if tag == 'a' and attrs:
>            print(tag,'-',attrs)
>
> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
> page = urllib.request.urlopen(url).read()
> parser = MyHTMLParser()
> parser.feed(str(page))
>
> I can do print(page) and get the entire HTML source, but
> parser.feed(str(page)) only spits out the information for the top links
> and none of the "revisionxxxx" links. Ultimately, I just want to find
> the name of the first "revisionxxxx" link (right now it's
> "revision1995", when a new build is uploaded it will be "revision2000"
> or whatever). I figure this is a relatively simple page; once I
> understand all of this, I can move on to more complicated pages.
You've got bad HTML. Look closely and you'll see the there's no space
between the "revisionxxxx" strings and the style tag following.
The parser doesn't like this. I don't know a solution other than
fixing the html.
(I created a local copy, edited it and it worked.)

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Trying to understand html.parser.HTMLParser David Robinow <drobinow@gmail.com> - 2011-05-15 21:06 -0400

csiph-web