Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #35800 > unrolled thread
| Started by | Peter Otten <__peter__@web.de> |
|---|---|
| First post | 2012-12-30 11:18 +0100 |
| Last post | 2012-12-30 11:18 +0100 |
| Articles | 1 — 1 participant |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Noob trying to parse bad HTML using xml.etree.ElementTree Peter Otten <__peter__@web.de> - 2012-12-30 11:18 +0100
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-12-30 11:18 +0100 |
| Subject | Re: Noob trying to parse bad HTML using xml.etree.ElementTree |
| Message-ID | <mailman.1462.1356862716.29569.python-list@python.org> |
Morten Guldager wrote: > 'Aloha Friends! > > I'm trying to process some HTML using xml.etree.ElementTree > Problem is that the HTML I'm trying to read have some not properly closed > tags, as the <img> shown in line 8 below. > > 1 from xml.etree import ElementTree > 2 > 3 tree = ElementTree > 4 e = tree.fromstring( > 5 """ > 6 <html> > 7 <body> > 8 <img src='mogul.jpg'> > 9 </body> > 10 </html> > 11 """) > > Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5, > column 14 > > I definitely do want to work DOM style, having the whole shebang loaded > into a nice structure before I start the real work. > > Question is if it's possible to tweak xml.etree.ElementTree to accept, and > understand sloppy html, or if you have suggestions for similar easy to use > framework, preferably among the included batteries? The <img> tag doesn't have a closing counterpart in HTML. That implies that valid HTML isn't valid XML and that you cannot use xml.etree with HTML. While it is not in the standard library a good alternative for XML that can deal with HTML, too, is lxml. See <http://lxml.de/lxmlhtml.html>. It also provides a way to cope with really broken html, modeled after BeautifulSoup.
Back to top | Article view | comp.lang.python
csiph-web