Groups > comp.lang.python > #28486 > unrolled thread

HTMLParser skipping HTML? [newbie]

Started by	BobAalsma <overhaalsgang_24_bob@me.com>
First post	2012-09-05 05:57 -0700
Last post	2012-09-06 02:01 -0700
Articles	6 — 2 participants

Back to article view | Back to comp.lang.python

  HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 05:57 -0700
    Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 15:54 +0200
    Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-05 10:23 -0700
      Re: HTMLParser skipping HTML? [newbie] Peter Otten <__peter__@web.de> - 2012-09-05 20:04 +0200
    Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 01:46 -0700
    Re: HTMLParser skipping HTML? [newbie] BobAalsma <overhaalsgang_24_bob@me.com> - 2012-09-06 02:01 -0700

#28486 — HTMLParser skipping HTML? [newbie]

From	BobAalsma <overhaalsgang_24_bob@me.com>
Date	2012-09-05 05:57 -0700
Subject	HTMLParser skipping HTML? [newbie]
Message-ID	<80d8623b-bb08-415c-900b-4a56556435ae@googlegroups.com>

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(

Code:
import urllib2
from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
	def handle_starttag(self, tag, attrs):
		print "Start tag:\n\t", tag
		for attr in attrs:
			print "\t\tattr:", attr
		# end for attr in attrs:
	#
	def handle_endtag(self, tag):
		print "End tag :\n\t", tag
	#
	def handle_data(self, data):
		if data != '\n\n':
			if data != '\n':
				print "Data :\t\t", data
			# end if 1
		# end if 2
	#
#
# ---------------------------------------------------------------------
#
def removeHtmlFromFileContents():
	TextOut = ''

	parser = MyHTMLParser()
	parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

	return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
	TextOut = removeHtmlFromFileContents()





Part of the output:
End tag :
	script
Start tag:
	title
Data :		Bob Aalsma - Nederland | LinkedIn
End tag :
	title
Start tag:
	script
		attr: ('type', 'text/javascript')
		attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
End tag :
	script
Start tag:
	link
		attr: ('rel', 'stylesheet')
		attr: ('type', 'text/css')
		attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:
	script
		attr: ('type', 'text/javascript')
		attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :
	script
End tag :
	head



But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>
<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
</head>

[toc] | [next] | [standalone]

#28496

From	Peter Otten <__peter__@web.de>
Date	2012-09-05 15:54 +0200
Message-ID	<mailman.238.1346853305.27098.python-list@python.org>
In reply to	#28486

BobAalsma wrote:

> I'm trying to understand the HTMLParser so I've copied some code from 
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and 
tried that on my LinkedIn page.
> No errors, but some of the tags seem to go missing for no apparent reason 
- any advice?
> I have searched extensively for this, but seem to be the only one with 
missing data from HTMLParser :(
> 
> Code:
> import urllib2
> from HTMLParser import HTMLParser
> 
> from GetHttpFileContents import getHttpFileContents
> 
> # create a subclass and override the handler methods
> class MyHTMLParser(HTMLParser):
>         def handle_starttag(self, tag, attrs):
>                 print "Start tag:\n\t", tag
>                 for attr in attrs:
>                         print "\t\tattr:", attr
>                 # end for attr in attrs:
>         #
>         def handle_endtag(self, tag):
>                 print "End tag :\n\t", tag
>         #
>         def handle_data(self, data):
>                 if data != '\n\n':
>                         if data != '\n':
>                                 print "Data :\t\t", data
>                         # end if 1
>                 # end if 2

Please no! A kitten dies every time you write one of those comments ;)

> def removeHtmlFromFileContents():
>         TextOut = ''
> 
>         parser = MyHTMLParser()
>         parser.feed(urllib2.urlopen(
>         'http://nl.linkedin.com/in/bobaalsma').read())
> 
>         return TextOut
> #
> # ---------------------------------------------------------------------
> #
> if __name__ == '__main__':
>         TextOut = removeHtmlFromFileContents()


After removing 

> from GetHttpFileContents import getHttpFileContents

from your script I get the following output (using python 2.7):

$ python parse_orig.py | grep meta -C2
        script
Start tag:
        meta
                attr: ('http-equiv', 'content-type')
                attr: ('content', 'text/html; charset=UTF-8')
Start tag:
        meta
                attr: ('http-equiv', 'X-UA-Compatible')
                attr: ('content', 'IE=8')
Start tag:
        meta
                attr: ('name', 'description')
                attr: ('content', 'Bekijk het (Nederland) professionele 
profiel van Bob Aalsma  op LinkedIn. LinkedIn is het grootste zakelijke 
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne 
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners 
vinden.')
Start tag:
        meta
                attr: ('name', 'pageImpressionID')
                attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
        meta
                attr: ('name', 'pageKey')
                attr: ('content', 'nprofile-public-success')
Start tag:
        meta
                attr: ('name', 'analyticsURL')
                attr: ('content', '/analytics/noauthtracker')
$ 

So there definitely are some meta tags. 

Note that if you're logged in into a site the html the browser is "seeing" 
may differ from the html you are retrieving via urllib.urlopen(...).read(). 
Perhaps that is the reason why you don't get what you expect.

[toc] | [prev] | [next] | [standalone]

#28530

From	BobAalsma <overhaalsgang_24_bob@me.com>
Date	2012-09-05 10:23 -0700
Message-ID	<0ac33349-4938-41a6-a129-05d676a5819f@googlegroups.com>
In reply to	#28486

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
> 
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
> 
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
> 
> 
> 
> Code:
> 
> import urllib2
> 
> from HTMLParser import HTMLParser
> 
> 
> 
> from GetHttpFileContents import getHttpFileContents
> 
> 
> 
> # create a subclass and override the handler methods
> 
> class MyHTMLParser(HTMLParser):
> 
> 	def handle_starttag(self, tag, attrs):
> 
> 		print "Start tag:\n\t", tag
> 
> 		for attr in attrs:
> 
> 			print "\t\tattr:", attr
> 
> 		# end for attr in attrs:
> 
> 	#
> 
> 	def handle_endtag(self, tag):
> 
> 		print "End tag :\n\t", tag
> 
> 	#
> 
> 	def handle_data(self, data):
> 
> 		if data != '\n\n':
> 
> 			if data != '\n':
> 
> 				print "Data :\t\t", data
> 
> 			# end if 1
> 
> 		# end if 2
> 
> 	#
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> def removeHtmlFromFileContents():
> 
> 	TextOut = ''
> 
> 
> 
> 	parser = MyHTMLParser()
> 
> 	parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
> 
> 
> 
> 	return TextOut
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> if __name__ == '__main__':
> 
> 	TextOut = removeHtmlFromFileContents()
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Part of the output:
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	title
> 
> Data :		Bob Aalsma - Nederland | LinkedIn
> 
> End tag :
> 
> 	title
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	link
> 
> 		attr: ('rel', 'stylesheet')
> 
> 		attr: ('type', 'text/css')
> 
> 		attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
> 
> End tag :
> 
> 	script
> 
> End tag :
> 
> 	head
> 
> 
> 
> 
> 
> 
> 
> But the source text for this is [and all of the "<meta ...> seem to go missing:
> 
> </script>
> 
> <title>Bob Aalsma | LinkedIn</title>
> 
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
> 
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
> 
> <meta name="LinkedInBookmarkType" content="profile">
> 
> <meta name="ShortTitle" content="Bob Aalsma">
> 
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
> 
> <meta name="UniqueID" content="24198692">
> 
> <meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
> 
> </head>

Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that could certainly be a reason. So how could I have the script log in?

[Didn't understand the bit about the kittens, though. How about that?]

[toc] | [prev] | [next] | [standalone]

#28532

From	Peter Otten <__peter__@web.de>
Date	2012-09-05 20:04 +0200
Message-ID	<mailman.260.1346868255.27098.python-list@python.org>
In reply to	#28530

BobAalsma wrote:

> [Didn't understand the bit about the kittens, though. How about that?]
> 
> Oops, sorry, found that bit about logging in - asked too soon; still
> wonder about the kittens ;)

I just wanted to tell you not to mark the end of an if-suite with an "# end 
if" comment. As soon as you become familiar with the language that will look 
like noise that detracts from the actual code.

In an attempt to make this advice appear less patronizing I wrapped it into 
a lame joke by alluding to

http://en.wikipedia.org/wiki/Every_time_you_masturbate..._God_kills_a_kitten

Sorry for the confusion -- I hope you aren't offended.

[toc] | [prev] | [next] | [standalone]

#28558

From	BobAalsma <overhaalsgang_24_bob@me.com>
Date	2012-09-06 01:46 -0700
Message-ID	<90f4a3c6-ee1d-4d28-a0a1-fdfa6657e944@googlegroups.com>
In reply to	#28486

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
> 
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
> 
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
> 
> 
> 
> Code:
> 
> import urllib2
> 
> from HTMLParser import HTMLParser
> 
> 
> 
> from GetHttpFileContents import getHttpFileContents
> 
> 
> 
> # create a subclass and override the handler methods
> 
> class MyHTMLParser(HTMLParser):
> 
> 	def handle_starttag(self, tag, attrs):
> 
> 		print "Start tag:\n\t", tag
> 
> 		for attr in attrs:
> 
> 			print "\t\tattr:", attr
> 
> 		# end for attr in attrs:
> 
> 	#
> 
> 	def handle_endtag(self, tag):
> 
> 		print "End tag :\n\t", tag
> 
> 	#
> 
> 	def handle_data(self, data):
> 
> 		if data != '\n\n':
> 
> 			if data != '\n':
> 
> 				print "Data :\t\t", data
> 
> 			# end if 1
> 
> 		# end if 2
> 
> 	#
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> def removeHtmlFromFileContents():
> 
> 	TextOut = ''
> 
> 
> 
> 	parser = MyHTMLParser()
> 
> 	parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
> 
> 
> 
> 	return TextOut
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> if __name__ == '__main__':
> 
> 	TextOut = removeHtmlFromFileContents()
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Part of the output:
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	title
> 
> Data :		Bob Aalsma - Nederland | LinkedIn
> 
> End tag :
> 
> 	title
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	link
> 
> 		attr: ('rel', 'stylesheet')
> 
> 		attr: ('type', 'text/css')
> 
> 		attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
> 
> End tag :
> 
> 	script
> 
> End tag :
> 
> 	head
> 
> 
> 
> 
> 
> 
> 
> But the source text for this is [and all of the "<meta ...> seem to go missing:
> 
> </script>
> 
> <title>Bob Aalsma | LinkedIn</title>
> 
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
> 
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
> 
> <meta name="LinkedInBookmarkType" content="profile">
> 
> <meta name="ShortTitle" content="Bob Aalsma">
> 
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
> 
> <meta name="UniqueID" content="24198692">
> 
> <meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
> 
> </head>

No offense and thanks for the reminder.
My background is software packages in 3GL, where different platforms mean different editors which mean it is sometimes difficult to recognize the end of blocks, especially when nested.
No need for that here, no.
I think it also means I'm still not really satisfied with my commenting in Python...

[toc] | [prev] | [next] | [standalone]

#28560

From	BobAalsma <overhaalsgang_24_bob@me.com>
Date	2012-09-06 02:01 -0700
Message-ID	<63adb7c0-f558-4f21-a49c-729bdb4e536a@googlegroups.com>
In reply to	#28486

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
> 
> No errors, but some of the tags seem to go missing for no apparent reason - any advice?
> 
> I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(
> 
> 
> 
> Code:
> 
> import urllib2
> 
> from HTMLParser import HTMLParser
> 
> 
> 
> from GetHttpFileContents import getHttpFileContents
> 
> 
> 
> # create a subclass and override the handler methods
> 
> class MyHTMLParser(HTMLParser):
> 
> 	def handle_starttag(self, tag, attrs):
> 
> 		print "Start tag:\n\t", tag
> 
> 		for attr in attrs:
> 
> 			print "\t\tattr:", attr
> 
> 		# end for attr in attrs:
> 
> 	#
> 
> 	def handle_endtag(self, tag):
> 
> 		print "End tag :\n\t", tag
> 
> 	#
> 
> 	def handle_data(self, data):
> 
> 		if data != '\n\n':
> 
> 			if data != '\n':
> 
> 				print "Data :\t\t", data
> 
> 			# end if 1
> 
> 		# end if 2
> 
> 	#
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> def removeHtmlFromFileContents():
> 
> 	TextOut = ''
> 
> 
> 
> 	parser = MyHTMLParser()
> 
> 	parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
> 
> 
> 
> 	return TextOut
> 
> #
> 
> # ---------------------------------------------------------------------
> 
> #
> 
> if __name__ == '__main__':
> 
> 	TextOut = removeHtmlFromFileContents()
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Part of the output:
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	title
> 
> Data :		Bob Aalsma - Nederland | LinkedIn
> 
> End tag :
> 
> 	title
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
> 
> End tag :
> 
> 	script
> 
> Start tag:
> 
> 	link
> 
> 		attr: ('rel', 'stylesheet')
> 
> 		attr: ('type', 'text/css')
> 
> 		attr: ('href', 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
> 
> Start tag:
> 
> 	script
> 
> 		attr: ('type', 'text/javascript')
> 
> 		attr: ('src', 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
> 
> End tag :
> 
> 	script
> 
> End tag :
> 
> 	head
> 
> 
> 
> 
> 
> 
> 
> But the source text for this is [and all of the "<meta ...> seem to go missing:
> 
> </script>
> 
> <title>Bob Aalsma | LinkedIn</title>
> 
> <link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
> 
> <link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
> 
> <meta name="LinkedInBookmarkType" content="profile">
> 
> <meta name="ShortTitle" content="Bob Aalsma">
> 
> <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
> 
> <meta name="UniqueID" content="24198692">
> 
> <meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
> 
> </head>

I can see that my Tester is not logging in: the reply from the site reads "<title>Sign In | LinkedIn</title>" rather than "<title>Bob Aalsma | LinkedIn</title>".
How can I tell which part is not correct?

[toc] | [prev] | [standalone]

csiph-web