Groups > comp.lang.python > #35798 > unrolled thread

Noob trying to parse bad HTML using xml.etree.ElementTree

Started by	Morten Guldager <morten.guldager@gmail.com>
First post	2012-12-30 10:52 +0100
Last post	2012-12-30 10:52 +0100
Articles	1 — 1 participant

Back to article view | Back to comp.lang.python

  Noob trying to parse bad HTML using xml.etree.ElementTree Morten Guldager <morten.guldager@gmail.com> - 2012-12-30 10:52 +0100

#35798 — Noob trying to parse bad HTML using xml.etree.ElementTree

From	Morten Guldager <morten.guldager@gmail.com>
Date	2012-12-30 10:52 +0100
Subject	Noob trying to parse bad HTML using xml.etree.ElementTree
Message-ID	<mailman.1460.1356861162.29569.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

'Aloha Friends!

I'm trying to process some HTML using xml.etree.ElementTree
Problem is that the HTML I'm trying to read have some not properly closed
tags, as the <img> shown in line 8 below.

  1 from xml.etree import ElementTree
  2
  3 tree = ElementTree
  4 e = tree.fromstring(
  5     """
  6         <html>
  7             <body>
  8                 <img src='mogul.jpg'>
  9             </body>
 10         </html>
 11     """)

Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5,
column 14

I definitely do want to work DOM style, having the whole shebang loaded
into a nice structure before I start the real work.

Question is if it's possible to tweak xml.etree.ElementTree to accept, and
understand sloppy html, or if you have suggestions for similar easy to use
framework, preferably among the included batteries?


-- 
/Morten %-)

[toc] | [standalone]

csiph-web

Noob trying to parse bad HTML using xml.etree.ElementTree

Contents

#35798 — Noob trying to parse bad HTML using xml.etree.ElementTree