Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #35800 > unrolled thread

Re: Noob trying to parse bad HTML using xml.etree.ElementTree

Started byPeter Otten <__peter__@web.de>
First post2012-12-30 11:18 +0100
Last post2012-12-30 11:18 +0100
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Noob trying to parse bad HTML using xml.etree.ElementTree Peter Otten <__peter__@web.de> - 2012-12-30 11:18 +0100

#35800 — Re: Noob trying to parse bad HTML using xml.etree.ElementTree

FromPeter Otten <__peter__@web.de>
Date2012-12-30 11:18 +0100
SubjectRe: Noob trying to parse bad HTML using xml.etree.ElementTree
Message-ID<mailman.1462.1356862716.29569.python-list@python.org>
Morten Guldager wrote:

> 'Aloha Friends!
> 
> I'm trying to process some HTML using xml.etree.ElementTree
> Problem is that the HTML I'm trying to read have some not properly closed
> tags, as the <img> shown in line 8 below.
> 
>   1 from xml.etree import ElementTree
>   2
>   3 tree = ElementTree
>   4 e = tree.fromstring(
>   5     """
>   6         <html>
>   7             <body>
>   8                 <img src='mogul.jpg'>
>   9             </body>
>  10         </html>
>  11     """)
> 
> Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5,
> column 14
> 
> I definitely do want to work DOM style, having the whole shebang loaded
> into a nice structure before I start the real work.
> 
> Question is if it's possible to tweak xml.etree.ElementTree to accept, and
> understand sloppy html, or if you have suggestions for similar easy to use
> framework, preferably among the included batteries?

The <img> tag doesn't have a closing counterpart in HTML. That implies that 
valid HTML isn't valid XML and that you cannot use xml.etree with HTML.

While it is not in the standard library a good alternative for XML that can 
deal with HTML, too, is lxml. See <http://lxml.de/lxmlhtml.html>.
It also provides a way to cope with really broken html, modeled after 
BeautifulSoup.

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web