Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5543
| Date | 2011-05-16 20:05 -0500 |
|---|---|
| From | Andrew Berg <bahamutzero8825@gmail.com> |
| Subject | Re: Trying to understand html.parser.HTMLParser |
| References | <4DD03B69.6050301@gmail.com> <BANLkTikcc6wVX+aO7KATa8AK1BJJKN5kMw@mail.gmail.com> <4DD0D1A2.6060109@free.fr> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1653.1305594331.9059.python-list@python.org> (permalink) |
On 2011.05.16 02:26 AM, Karim wrote:
> Use regular expression for bad HTLM or beautifulSoup (google it), below
> a exemple to extract all html links:
>
> linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
> for link in linksList:
> print link
I was afraid I might have to use regexes (mostly because I could never
understand them).
Even the BeautifulSoup website itself admits it's awful with Python 3 -
only the admittedly broken 3.1.0 will work with Python 3 at all.
ElementTree doesn't seem to have been updated in a long time, so I'll
assume it won't work with Python 3.
lxml looks promising, but it doesn't say anywhere whether it'll work on
Python 3 or not, which is puzzling since the latest release was only a
couple months ago.
Actually, if I'm going to use regex, I might as well try to implement
Versions* in Python.
Thanks for the answers!
*http://en.totalcmd.pl/download/wfx/net/Versions (original, made for
Total Commander) and
https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/
(clone implemented as a Firefox add-on; it's so wonderful, I even wrote
the docs for it!)
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Trying to understand html.parser.HTMLParser Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-16 20:05 -0500
csiph-web