Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #28496

Re: HTMLParser skipping HTML? [newbie]

Path csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'handler': 0.04; 'output': 0.04; '(using': 0.07; '__name__': 0.07; 'data):': 0.07; 'override': 0.07; 'parser': 0.07; 'urllib2': 0.07; 'python': 0.09; 'advice?': 0.09; 'errors,': 0.09; 'grep': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subclass': 0.09; 'def': 0.10; "skip:' 30": 0.15; "'__main__':": 0.16; 'attr': 0.16; 'attrs:': 0.16; 'htmlparser': 0.16; 'received:80.91.229.3': 0.16; 'received:dip.t-dialin.net': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-dialin.net': 0.16; 'retrieving': 0.16; 'tags.': 0.16; 'ter': 0.16; 'wrote:': 0.17; 'copied': 0.17; 'differ': 0.17; 'trying': 0.21; 'import': 0.21; "i've": 0.23; 'script': 0.24; 'tried': 0.25; 'header:User-Agent:1': 0.26; 'header:X-Complaints- To:1': 0.28; 'extensively': 0.29; 'van': 0.29; 'class': 0.29; "i'm": 0.29; "skip:' 10": 0.30; 'code': 0.31; 'url:python': 0.32; "skip:' 20": 0.32; 'print': 0.32; 'comments': 0.33; 'page.': 0.33; 'dies': 0.33; 'to:addr:python-list': 0.33; 'code:': 0.33; 'subject:]': 0.35; 'there': 0.35; 'received:org': 0.36; 'skip:u 20': 0.36; 'but': 0.36; 'url:org': 0.36; 'url:library': 0.36; 'skip:p 20': 0.36; 'url:in': 0.37; 'why': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'url:docs': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'end': 0.40; 'your': 0.60; 'linkedin': 0.65; 'kunnen': 0.84; 'apparent': 0.91; 'bob': 0.91
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Peter Otten <__peter__@web.de>
Subject Re: HTMLParser skipping HTML? [newbie]
Date Wed, 05 Sep 2012 15:54:43 +0200
Organization None
References <80d8623b-bb08-415c-900b-4a56556435ae@googlegroups.com>
Mime-Version 1.0
Content-Type text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding 7Bit
X-Gmane-NNTP-Posting-Host p5084b17a.dip.t-dialin.net
User-Agent KNode/4.7.3
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.238.1346853305.27098.python-list@python.org> (permalink)
Lines 95
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1346853305 news.xs4all.nl 6935 [2001:888:2000:d::a6]:37912
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:28496

Show key headers only | View raw


BobAalsma wrote:

> I'm trying to understand the HTMLParser so I've copied some code from 
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and 
tried that on my LinkedIn page.
> No errors, but some of the tags seem to go missing for no apparent reason 
- any advice?
> I have searched extensively for this, but seem to be the only one with 
missing data from HTMLParser :(
> 
> Code:
> import urllib2
> from HTMLParser import HTMLParser
> 
> from GetHttpFileContents import getHttpFileContents
> 
> # create a subclass and override the handler methods
> class MyHTMLParser(HTMLParser):
>         def handle_starttag(self, tag, attrs):
>                 print "Start tag:\n\t", tag
>                 for attr in attrs:
>                         print "\t\tattr:", attr
>                 # end for attr in attrs:
>         #
>         def handle_endtag(self, tag):
>                 print "End tag :\n\t", tag
>         #
>         def handle_data(self, data):
>                 if data != '\n\n':
>                         if data != '\n':
>                                 print "Data :\t\t", data
>                         # end if 1
>                 # end if 2

Please no! A kitten dies every time you write one of those comments ;)

> def removeHtmlFromFileContents():
>         TextOut = ''
> 
>         parser = MyHTMLParser()
>         parser.feed(urllib2.urlopen(
>         'http://nl.linkedin.com/in/bobaalsma').read())
> 
>         return TextOut
> #
> # ---------------------------------------------------------------------
> #
> if __name__ == '__main__':
>         TextOut = removeHtmlFromFileContents()


After removing 

> from GetHttpFileContents import getHttpFileContents

from your script I get the following output (using python 2.7):

$ python parse_orig.py | grep meta -C2
        script
Start tag:
        meta
                attr: ('http-equiv', 'content-type')
                attr: ('content', 'text/html; charset=UTF-8')
Start tag:
        meta
                attr: ('http-equiv', 'X-UA-Compatible')
                attr: ('content', 'IE=8')
Start tag:
        meta
                attr: ('name', 'description')
                attr: ('content', 'Bekijk het (Nederland) professionele 
profiel van Bob Aalsma  op LinkedIn. LinkedIn is het grootste zakelijke 
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne 
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners 
vinden.')
Start tag:
        meta
                attr: ('name', 'pageImpressionID')
                attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
        meta
                attr: ('name', 'pageKey')
                attr: ('content', 'nprofile-public-success')
Start tag:
        meta
                attr: ('name', 'analyticsURL')
                attr: ('content', '/analytics/noauthtracker')
$ 

So there definitely are some meta tags. 

Note that if you're logged in into a site the html the browser is "seeing" 
may differ from the html you are retrieving via urllib.urlopen(...).read(). 
Perhaps that is the reason why you don't get what you expect.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 05:57 -0700
  Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 15:54 +0200
  Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 10:23 -0700
    Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 20:04 +0200
  Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 01:46 -0700
  Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 02:01 -0700

csiph-web