Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.039 X-Spam-Evidence: '*H*': 0.92; '*S*': 0.00; 'parser': 0.05; '(right': 0.09; 'pages.': 0.09; 'sun,': 0.09; 'pm,': 0.11; 'this:': 0.11; 'wrote:': 0.14; "'a'": 0.16; 'subject:Trying': 0.16; '\xa0def': 0.16; 'skip:m 30': 0.16; 'figure': 0.18; 'header:In-Reply-To:1': 0.22; 'trying': 0.23; '\xa0if': 0.23; 'script': 0.26; "i'm": 0.26; 'message-id:@mail.gmail.com': 0.28; "doesn't": 0.28; 'class': 0.29; 'this.': 0.30; 'edited': 0.31; 'html.': 0.31; 'import': 0.32; 'page.': 0.32; 'to:addr:python-list': 0.32; 'relatively': 0.33; 'created': 0.33; 'page': 0.33; 'test': 0.33; "isn't": 0.34; 'skip:" 10': 0.34; 'got': 0.34; 'closely': 0.35; 'fixing': 0.35; 'source,': 0.35; 'none': 0.36; 'received:209.85': 0.37; 'andrew': 0.38; 'strings': 0.38; 'received:google.com': 0.38; 'but': 0.38; 'returning': 0.39; 'to:addr:python.org': 0.39; 'received:209': 0.39; 'solution': 0.40; "it's": 0.40; 'header:Received:5': 0.40; 'simple': 0.60; 'url:nl': 0.60; 'skip:h 20': 0.60; '2011': 0.62; 'link': 0.62; 'following.': 0.84; 'tag,': 0.84; 'ultimately,': 0.84; 'url:dir': 0.84; 'page;': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=eagvRDERGSMpDo/0BSHQnIB0BXkaCLUxVrizMJd/mX4=; b=FjTkD50tUvG1emfmr0uSA9nrpGqCDzW0fwZjfHjzxALBSyuk+c0atDx4Y06oy13XJF /aNiWMxecBhJbL+cUwgI0fMsORzqids2SbpyiFUENKPk6kuqDrTpD5QOhjQLXvLSphfe Ue2NVMQgmmHdrqVcZPeoRTKIcZRIup1UEq78k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=EaZzRj++OdwZdjfwRzDOx0VHRhB4J6YhCiFp+9ScFHZ9qgGyFASKvjnLYltxy3UCrO Njy0Ltz0yoXWQTde2CQ5cCaBOx81o9NsoHwlbx8P7WqnzXBHsj1dCj329MNPZM0h3fCy WzpiOKU5iW7uiZX704BdM4RZa/pkO77JXpsSY= MIME-Version: 1.0 In-Reply-To: <4DD03B69.6050301@gmail.com> References: <4DD03B69.6050301@gmail.com> Date: Sun, 15 May 2011 21:06:49 -0400 Subject: Re: Trying to understand html.parser.HTMLParser From: David Robinow To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 29 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1305508012 news.xs4all.nl 41110 [::ffff:82.94.164.166]:49382 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:5454 On Sun, May 15, 2011 at 4:45 PM, Andrew Berg wr= ote: > I'm trying to understand why HMTLParser.feed() isn't returning the whole > page. My test script is this: > > import urllib.request > import html.parser > class MyHTMLParser(html.parser.HTMLParser): > =A0 =A0def handle_starttag(self, tag, attrs): > =A0 =A0 =A0 =A0if tag =3D=3D 'a' and attrs: > =A0 =A0 =A0 =A0 =A0 =A0print(tag,'-',attrs) > > url =3D 'http://x264.nl/x264/?dir=3D./64bit/8bit_depth' > page =3D urllib.request.urlopen(url).read() > parser =3D MyHTMLParser() > parser.feed(str(page)) > > I can do print(page) and get the entire HTML source, but > parser.feed(str(page)) only spits out the information for the top links > and none of the "revisionxxxx" links. Ultimately, I just want to find > the name of the first "revisionxxxx" link (right now it's > "revision1995", when a new build is uploaded it will be "revision2000" > or whatever). I figure this is a relatively simple page; once I > understand all of this, I can move on to more complicated pages. You've got bad HTML. Look closely and you'll see the there's no space between the "revisionxxxx" strings and the style tag following. The parser doesn't like this. I don't know a solution other than fixing the html. (I created a local copy, edited it and it worked.)