Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #28486 > unrolled thread
| Started by | BobAalsma <overhaalsgang_24_bob@me.com> |
|---|---|
| First post | 2012-09-05 05:57 -0700 |
| Last post | 2012-09-06 02:01 -0700 |
| Articles | 6 — 2 participants |
Back to article view | Back to comp.lang.python
HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 05:57 -0700
Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 15:54 +0200
Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 10:23 -0700
Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 20:04 +0200
Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 01:46 -0700
Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 02:01 -0700
| From | BobAalsma <overhaalsgang_24_bob@me.com> |
|---|---|
| Date | 2012-09-05 05:57 -0700 |
| Subject | HTMLParser skipping HTML? [newbie] |
| Message-ID | <80d8623b-bb08-415c-900b-4a56556435ae@googlegroups.com> |
I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
Code:
import urllib2
from HTMLParser import HTMLParser
from GetHttpFileContents import getHttpFileContents
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2
#
#
# ---------------------------------------------------------------------
#
def removeHtmlFromFileContents():
TextOut = ''
parser = MyHTMLParser()
parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()
Part of the output:
End tag :
script
Start tag:
title
Data : Bob Aalsma - Nederland | LinkedIn
End tag :
title
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
End tag :
script
Start tag:
link
attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :
script
End tag :
head
But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>
<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
</head>
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-09-05 15:54 +0200 |
| Message-ID | <mailman.238.1346853305.27098.python-list@python.org> |
| In reply to | #28486 |
BobAalsma wrote:
> I'm trying to understand the HTMLParser so I've copied some code from
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and
tried that on my LinkedIn page.
> No errors, but some of the tags seem to go missing for no apparent reason
- any advice?
> I have searched extensively for this, but seem to be the only one with
missing data from HTMLParser :(
>
> Code:
> import urllib2
> from HTMLParser import HTMLParser
>
> from GetHttpFileContents import getHttpFileContents
>
> # create a subclass and override the handler methods
> class MyHTMLParser(HTMLParser):
> def handle_starttag(self, tag, attrs):
> print "Start tag:\n\t", tag
> for attr in attrs:
> print "\t\tattr:", attr
> # end for attr in attrs:
> #
> def handle_endtag(self, tag):
> print "End tag :\n\t", tag
> #
> def handle_data(self, data):
> if data != '\n\n':
> if data != '\n':
> print "Data :\t\t", data
> # end if 1
> # end if 2
Please no! A kitten dies every time you write one of those comments ;)
> def removeHtmlFromFileContents():
> TextOut = ''
>
> parser = MyHTMLParser()
> parser.feed(urllib2.urlopen(
> 'http://nl.linkedin.com/in/bobaalsma').read())
>
> return TextOut
> #
> # ---------------------------------------------------------------------
> #
> if __name__ == '__main__':
> TextOut = removeHtmlFromFileContents()
After removing
> from GetHttpFileContents import getHttpFileContents
from your script I get the following output (using python 2.7):
$ python parse_orig.py | grep meta -C2
script
Start tag:
meta
attr: ('http-equiv', 'content-type')
attr: ('content', 'text/html; charset=UTF-8')
Start tag:
meta
attr: ('http-equiv', 'X-UA-Compatible')
attr: ('content', 'IE=8')
Start tag:
meta
attr: ('name', 'description')
attr: ('content', 'Bekijk het (Nederland) professionele
profiel van Bob Aalsma op LinkedIn. LinkedIn is het grootste zakelijke
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners
vinden.')
Start tag:
meta
attr: ('name', 'pageImpressionID')
attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
meta
attr: ('name', 'pageKey')
attr: ('content', 'nprofile-public-success')
Start tag:
meta
attr: ('name', 'analyticsURL')
attr: ('content', '/analytics/noauthtracker')
$
So there definitely are some meta tags.
Note that if you're logged in into a site the html the browser is "seeing"
may differ from the html you are retrieving via urllib.urlopen(...).read().
Perhaps that is the reason why you don't get what you expect.
[toc] | [prev] | [next] | [standalone]
| From | BobAalsma <overhaalsgang_24_bob@me.com> |
|---|---|
| Date | 2012-09-05 10:23 -0700 |
| Message-ID | <0ac33349-4938-41a6-a129-05d676a5819f@googlegroups.com> |
| In reply to | #28486 |
Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
>
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
>
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
>
>
>
> Code:
>
> import urllib2
>
> from HTMLParser import HTMLParser
>
>
>
> from GetHttpFileContents import getHttpFileContents
>
>
>
> # create a subclass and override the handler methods
>
> class MyHTMLParser(HTMLParser):
>
> def handle_starttag(self, tag, attrs):
>
> print "Start tag:\n\t", tag
>
> for attr in attrs:
>
> print "\t\tattr:", attr
>
> # end for attr in attrs:
>
> #
>
> def handle_endtag(self, tag):
>
> print "End tag :\n\t", tag
>
> #
>
> def handle_data(self, data):
>
> if data != '\n\n':
>
> if data != '\n':
>
> print "Data :\t\t", data
>
> # end if 1
>
> # end if 2
>
> #
>
> #
>
> # ---------------------------------------------------------------------
>
> #
>
> def removeHtmlFromFileContents():
>
> TextOut = ''
>
>
>
> parser = MyHTMLParser()
>
> parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
>
>
>
> return TextOut
>
> #
>
> # ---------------------------------------------------------------------
>
> #
>
> if __name__ == '__main__':
>
> TextOut = removeHtmlFromFileContents()
>
>
>
>
>
>
>
>
>
>
>
> Part of the output:
>
> End tag :
>
> script
>
> Start tag:
>
> title
>
> Data : Bob Aalsma - Nederland | LinkedIn
>
> End tag :
>
> title
>
> Start tag:
>
> script
>
> attr: ('type', 'text/javascript')
>
> attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
>
> End tag :
>
> script
>
> Start tag:
>
> link
>
> attr: ('rel', 'stylesheet')
>
> attr: ('type', 'text/css')
>
> attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
>
> Start tag:
>
> script
>
> attr: ('type', 'text/javascript')
>
> attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
>
> End tag :
>
> script
>
> End tag :
>
> head
>
>
>
>
>
>
>
> But the source text for this is [and all of the "<meta ...> seem to go missing:
>
> </script>
>
> <title>Bob Aalsma | LinkedIn</title>
>
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
>
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
>
> <meta name="LinkedInBookmarkType" content="profile">
>
> <meta name="ShortTitle" content="Bob Aalsma">
>
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
>
> <meta name="UniqueID" content="24198692">
>
> <meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
>
> </head>
Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that could certainly be a reason. So how could I have the script log in?
[Didn't understand the bit about the kittens, though. How about that?]
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-09-05 20:04 +0200 |
| Message-ID | <mailman.260.1346868255.27098.python-list@python.org> |
| In reply to | #28530 |
BobAalsma wrote: > [Didn't understand the bit about the kittens, though. How about that?] > > Oops, sorry, found that bit about logging in - asked too soon; still > wonder about the kittens ;) I just wanted to tell you not to mark the end of an if-suite with an "# end if" comment. As soon as you become familiar with the language that will look like noise that detracts from the actual code. In an attempt to make this advice appear less patronizing I wrapped it into a lame joke by alluding to http://en.wikipedia.org/wiki/Every_time_you_masturbate..._God_kills_a_kitten Sorry for the confusion -- I hope you aren't offended.
[toc] | [prev] | [next] | [standalone]
| From | BobAalsma <overhaalsgang_24_bob@me.com> |
|---|---|
| Date | 2012-09-06 01:46 -0700 |
| Message-ID | <90f4a3c6-ee1d-4d28-a0a1-fdfa6657e944@googlegroups.com> |
| In reply to | #28486 |
Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
>
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
>
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
>
>
>
> Code:
>
> import urllib2
>
> from HTMLParser import HTMLParser
>
>
>
> from GetHttpFileContents import getHttpFileContents
>
>
>
> # create a subclass and override the handler methods
>
> class MyHTMLParser(HTMLParser):
>
> def handle_starttag(self, tag, attrs):
>
> print "Start tag:\n\t", tag
>
> for attr in attrs:
>
> print "\t\tattr:", attr
>
> # end for attr in attrs:
>
> #
>
> def handle_endtag(self, tag):
>
> print "End tag :\n\t", tag
>
> #
>
> def handle_data(self, data):
>
> if data != '\n\n':
>
> if data != '\n':
>
> print "Data :\t\t", data
>
> # end if 1
>
> # end if 2
>
> #
>
> #
>
> # ---------------------------------------------------------------------
>
> #
>
> def removeHtmlFromFileContents():
>
> TextOut = ''
>
>
>
> parser = MyHTMLParser()
>
> parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
>
>
>
> return TextOut
>
> #
>
> # ---------------------------------------------------------------------
>
> #
>
> if __name__ == '__main__':
>
> TextOut = removeHtmlFromFileContents()
>
>
>
>
>
>
>
>
>
>
>
> Part of the output:
>
> End tag :
>
> script
>
> Start tag:
>
> title
>
> Data : Bob Aalsma - Nederland | LinkedIn
>
> End tag :
>
> title
>
> Start tag:
>
> script
>
> attr: ('type', 'text/javascript')
>
> attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
>
> End tag :
>
> script
>
> Start tag:
>
> link
>
> attr: ('rel', 'stylesheet')
>
> attr: ('type', 'text/css')
>
> attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
>
> Start tag:
>
> script
>
> attr: ('type', 'text/javascript')
>
> attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
>
> End tag :
>
> script
>
> End tag :
>
> head
>
>
>
>
>
>
>
> But the source text for this is [and all of the "<meta ...> seem to go missing:
>
> </script>
>
> <title>Bob Aalsma | LinkedIn</title>
>
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
>
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
>
> <meta name="LinkedInBookmarkType" content="profile">
>
> <meta name="ShortTitle" content="Bob Aalsma">
>
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
>
> <meta name="UniqueID" content="24198692">
>
> <meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
>
> </head>
No offense and thanks for the reminder.
My background is software packages in 3GL, where different platforms mean different editors which mean it is sometimes difficult to recognize the end of blocks, especially when nested.
No need for that here, no.
I think it also means I'm still not really satisfied with my commenting in Python...
[toc] | [prev] | [next] | [standalone]
| From | BobAalsma <overhaalsgang_24_bob@me.com> |
|---|---|
| Date | 2012-09-06 02:01 -0700 |
| Message-ID | <63adb7c0-f558-4f21-a49c-729bdb4e536a@googlegroups.com> |
| In reply to | #28486 |
Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
>
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
>
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
>
>
>
> Code:
>
> import urllib2
>
> from HTMLParser import HTMLParser
>
>
>
> from GetHttpFileContents import getHttpFileContents
>
>
>
> # create a subclass and override the handler methods
>
> class MyHTMLParser(HTMLParser):
>
> def handle_starttag(self, tag, attrs):
>
> print "Start tag:\n\t", tag
>
> for attr in attrs:
>
> print "\t\tattr:", attr
>
> # end for attr in attrs:
>
> #
>
> def handle_endtag(self, tag):
>
> print "End tag :\n\t", tag
>
> #
>
> def handle_data(self, data):
>
> if data != '\n\n':
>
> if data != '\n':
>
> print "Data :\t\t", data
>
> # end if 1
>
> # end if 2
>
> #
>
> #
>
> # ---------------------------------------------------------------------
>
> #
>
> def removeHtmlFromFileContents():
>
> TextOut = ''
>
>
>
> parser = MyHTMLParser()
>
> parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
>
>
>
> return TextOut
>
> #
>
> # ---------------------------------------------------------------------
>
> #
>
> if __name__ == '__main__':
>
> TextOut = removeHtmlFromFileContents()
>
>
>
>
>
>
>
>
>
>
>
> Part of the output:
>
> End tag :
>
> script
>
> Start tag:
>
> title
>
> Data : Bob Aalsma - Nederland | LinkedIn
>
> End tag :
>
> title
>
> Start tag:
>
> script
>
> attr: ('type', 'text/javascript')
>
> attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
>
> End tag :
>
> script
>
> Start tag:
>
> link
>
> attr: ('rel', 'stylesheet')
>
> attr: ('type', 'text/css')
>
> attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
>
> Start tag:
>
> script
>
> attr: ('type', 'text/javascript')
>
> attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
>
> End tag :
>
> script
>
> End tag :
>
> head
>
>
>
>
>
>
>
> But the source text for this is [and all of the "<meta ...> seem to go missing:
>
> </script>
>
> <title>Bob Aalsma | LinkedIn</title>
>
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
>
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
>
> <meta name="LinkedInBookmarkType" content="profile">
>
> <meta name="ShortTitle" content="Bob Aalsma">
>
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
>
> <meta name="UniqueID" content="24198692">
>
> <meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
>
> </head>
I can see that my Tester is not logging in: the reply from the site reads "<title>Sign In | LinkedIn</title>" rather than "<title>Bob Aalsma | LinkedIn</title>".
How can I tell which part is not correct?
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web