Re: HTMLParser skipping HTML? [newbie]

Path	csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<python-python-list@m.gmane.org>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.000
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'handler': 0.04; 'output': 0.04; '(using': 0.07; '__name__': 0.07; 'data):': 0.07; 'override': 0.07; 'parser': 0.07; 'urllib2': 0.07; 'python': 0.09; 'advice?': 0.09; 'errors,': 0.09; 'grep': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subclass': 0.09; 'def': 0.10; "skip:' 30": 0.15; "'__main__':": 0.16; 'attr': 0.16; 'attrs:': 0.16; 'htmlparser': 0.16; 'received:80.91.229.3': 0.16; 'received:dip.t-dialin.net': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-dialin.net': 0.16; 'retrieving': 0.16; 'tags.': 0.16; 'ter': 0.16; 'wrote:': 0.17; 'copied': 0.17; 'differ': 0.17; 'trying': 0.21; 'import': 0.21; "i've": 0.23; 'script': 0.24; 'tried': 0.25; 'header:User-Agent:1': 0.26; 'header:X-Complaints- To:1': 0.28; 'extensively': 0.29; 'van': 0.29; 'class': 0.29; "i'm": 0.29; "skip:' 10": 0.30; 'code': 0.31; 'url:python': 0.32; "skip:' 20": 0.32; 'print': 0.32; 'comments': 0.33; 'page.': 0.33; 'dies': 0.33; 'to:addr:python-list': 0.33; 'code:': 0.33; 'subject:]': 0.35; 'there': 0.35; 'received:org': 0.36; 'skip:u 20': 0.36; 'but': 0.36; 'url:org': 0.36; 'url:library': 0.36; 'skip:p 20': 0.36; 'url:in': 0.37; 'why': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'url:docs': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'end': 0.40; 'your': 0.60; 'linkedin': 0.65; 'kunnen': 0.84; 'apparent': 0.91; 'bob': 0.91
X-Injected-Via-Gmane	http://gmane.org/
To	python-list@python.org
From	Peter Otten <__peter__@web.de>
Subject	Re: HTMLParser skipping HTML? [newbie]
Date	Wed, 05 Sep 2012 15:54:43 +0200
Organization	None
References	<80d8623b-bb08-415c-900b-4a56556435ae@googlegroups.com>
Mime-Version	1.0
Content-Type	text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding	7Bit
X-Gmane-NNTP-Posting-Host	p5084b17a.dip.t-dialin.net
User-Agent	KNode/4.7.3
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.238.1346853305.27098.python-list@python.org> (permalink)
Lines	95
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1346853305 news.xs4all.nl 6935 [2001:888:2000:d::a6]:37912
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:28496

Show key headers only | View raw

BobAalsma wrote:

> I'm trying to understand the HTMLParser so I've copied some code from 
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and 
tried that on my LinkedIn page.
> No errors, but some of the tags seem to go missing for no apparent reason 
- any advice?
> I have searched extensively for this, but seem to be the only one with 
missing data from HTMLParser :(
> 
> Code:
> import urllib2
> from HTMLParser import HTMLParser
> 
> from GetHttpFileContents import getHttpFileContents
> 
> # create a subclass and override the handler methods
> class MyHTMLParser(HTMLParser):
>         def handle_starttag(self, tag, attrs):
>                 print "Start tag:\n\t", tag
>                 for attr in attrs:
>                         print "\t\tattr:", attr
>                 # end for attr in attrs:
>         #
>         def handle_endtag(self, tag):
>                 print "End tag :\n\t", tag
>         #
>         def handle_data(self, data):
>                 if data != '\n\n':
>                         if data != '\n':
>                                 print "Data :\t\t", data
>                         # end if 1
>                 # end if 2

Please no! A kitten dies every time you write one of those comments ;)

> def removeHtmlFromFileContents():
>         TextOut = ''
> 
>         parser = MyHTMLParser()
>         parser.feed(urllib2.urlopen(
>         'http://nl.linkedin.com/in/bobaalsma').read())
> 
>         return TextOut
> #
> # ---------------------------------------------------------------------
> #
> if __name__ == '__main__':
>         TextOut = removeHtmlFromFileContents()


After removing 

> from GetHttpFileContents import getHttpFileContents

from your script I get the following output (using python 2.7):

$ python parse_orig.py | grep meta -C2
        script
Start tag:
        meta
                attr: ('http-equiv', 'content-type')
                attr: ('content', 'text/html; charset=UTF-8')
Start tag:
        meta
                attr: ('http-equiv', 'X-UA-Compatible')
                attr: ('content', 'IE=8')
Start tag:
        meta
                attr: ('name', 'description')
                attr: ('content', 'Bekijk het (Nederland) professionele 
profiel van Bob Aalsma  op LinkedIn. LinkedIn is het grootste zakelijke 
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne 
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners 
vinden.')
Start tag:
        meta
                attr: ('name', 'pageImpressionID')
                attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
        meta
                attr: ('name', 'pageKey')
                attr: ('content', 'nprofile-public-success')
Start tag:
        meta
                attr: ('name', 'analyticsURL')
                attr: ('content', '/analytics/noauthtracker')
$ 

So there definitely are some meta tags. 

Note that if you're logged in into a site the html the browser is "seeing" 
may differ from the html you are retrieving via urllib.urlopen(...).read(). 
Perhaps that is the reason why you don't get what you expect.

Thread

HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 05:57 -0700
  Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 15:54 +0200
  Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 10:23 -0700
    Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 20:04 +0200
  Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 01:46 -0700
  Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 02:01 -0700

csiph-web