Re: Trying to understand html.parser.HTMLParser

Date	2011-05-16 20:05 -0500
From	Andrew Berg <bahamutzero8825@gmail.com>
Subject	Re: Trying to understand html.parser.HTMLParser
References	<4DD03B69.6050301@gmail.com> <BANLkTikcc6wVX+aO7KATa8AK1BJJKN5kMw@mail.gmail.com> <4DD0D1A2.6060109@free.fr>
Newsgroups	comp.lang.python
Message-ID	<mailman.1653.1305594331.9059.python-list@python.org> (permalink)

Show all headers | View raw

On 2011.05.16 02:26 AM, Karim wrote:
> Use regular expression for bad HTLM or beautifulSoup (google it), below 
> a exemple to extract all html links:
>
> linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
> for link in linksList:
>      print link
I was afraid I might have to use regexes (mostly because I could never
understand them).
Even the BeautifulSoup website itself admits it's awful with Python 3 -
only the admittedly broken 3.1.0 will work with Python 3 at all.
ElementTree doesn't seem to have been updated in a long time, so I'll
assume it won't work with Python 3.
lxml looks promising, but it doesn't say anywhere whether it'll work on
Python 3 or not, which is puzzling since the latest release was only a
couple months ago.

Actually, if I'm going to use regex, I might as well try to implement
Versions* in Python.

Thanks for the answers!

*http://en.totalcmd.pl/download/wfx/net/Versions (original, made for
Total Commander) and
https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/
(clone implemented as a Firefox add-on; it's so wonderful, I even wrote
the docs for it!)

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: Trying to understand html.parser.HTMLParser Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-16 20:05 -0500

csiph-web