Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '"""': 0.05; 'preferably': 0.05; 'column': 0.07; 'python': 0.09; '""")': 0.09; '8bit%:2': 0.09; 'subject:trying': 0.09; 'subject:using': 0.09; 'properly': 0.15; 'framework,': 0.16; 'tweak': 0.16; 'work.\xa0': 0.16; '8bit%:3': 0.17; 'skip:u 30': 0.17; 'trying': 0.21; 'import': 0.21; 'work.': 0.23; 'question': 0.27; 'tree': 0.27; 'message- id:@mail.gmail.com': 0.27; 'dom': 0.29; '8bit%:5': 0.29; 'included': 0.29; 'skip:& 10': 0.29; "i'm": 0.29; 'structure': 0.32; 'html,': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'received:google.com': 0.34; 'similar': 0.35; 'received:209.85': 0.35; 'loaded': 0.36; 'possible': 0.37; 'received:209': 0.37; 'some': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'easy': 0.60; 'real': 0.61; 'below.': 0.68; 'tags,': 0.81; '<img': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=1LJPPyIv/OZHXA9ip3q6odIFtvsBUL7mWtTs9Jl1sS0=; b=d9oFwJVBiXte/YQHB3hHeiibD7VawlHf6VOwRUiyWz1wIdZd29b+x5mGlk2XJq88OJ VlV1vohTxCWdo1nryPfTnsQOqRWoufmAZe0VFZICpTtNt8mjY36sJhjbJmHj4OutEpmH dhu/RjI5lousbfWFPrxFaSlTDLKxLz5WvNsW6HpCLTKKGfxF/2CNOgZ4/RhMD9pCgQEQ /Gee6Hb3i2GTHewwXfRwYJ0VXVMuSMiS2CTnZpa8RdUtHaCPEjdvck+kCrQKR1orIb4+ 25MeCbnG8RaUa9SkeXfIq36i81H54JvfoZRZRL+XQD2qqn7sl7uV1g0/aNhsPpxhSh0P 4X5w== MIME-Version: 1.0 Date: Sun, 30 Dec 2012 10:52:33 +0100 Subject: Noob trying to parse bad HTML using xml.etree.ElementTree From: Morten Guldager To: python-list@python.org Content-Type: multipart/alternative; boundary=14dae9d7127815703304d20edb1a X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 63 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1356861162 news.xs4all.nl 6962 [2001:888:2000:d::a6]:52828 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35798 --14dae9d7127815703304d20edb1a Content-Type: text/plain; charset=ISO-8859-1 'Aloha Friends! I'm trying to process some HTML using xml.etree.ElementTree Problem is that the HTML I'm trying to read have some not properly closed tags, as the shown in line 8 below. 1 from xml.etree import ElementTree 2 3 tree = ElementTree 4 e = tree.fromstring( 5 """ 6 7 8 9 10 11 """) Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5, column 14 I definitely do want to work DOM style, having the whole shebang loaded into a nice structure before I start the real work. Question is if it's possible to tweak xml.etree.ElementTree to accept, and understand sloppy html, or if you have suggestions for similar easy to use framework, preferably among the included batteries? -- /Morten %-) --14dae9d7127815703304d20edb1a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 'Aloha Friends!

I'm trying to process some HTML = using=A0xml.etree.ElementTree
Problem is that the HTML I'm tr= ying to read have some not properly closed tags, as the <img> shown i= n line 8 below.

=A0 1 from xml.etree import ElementTree
=A0 2=A0
=A0 3 tree =3D ElementTree
=A0 4 e =3D= tree.fromstring(
=A0 5 =A0 =A0 """
=A0 = 6 =A0 =A0 =A0 =A0 <html>
=A0 7 =A0 =A0 =A0 =A0 =A0 =A0 <body>
=A0 8 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 <img src=3D'mogul.jpg'>
=A0 9 = =A0 =A0 =A0 =A0 =A0 =A0 </body>
=A010 =A0 =A0 =A0 =A0 </= html>
=A011 =A0 =A0 """)

Python whines:=A0xml.etree.ElementTree.ParseError: mism= atched tag: line 5, column 14

I=A0definitely=A0do = want to work DOM style, having the whole shebang loaded into a nice structu= re before I start the real work.=A0

Question is if it's possible to tweak=A0xml.etree.E= lementTree to accept, and understand sloppy html, or if you have suggestion= s for similar easy to use framework, preferably among the included batterie= s?


--
/Morten %-)=20
--14dae9d7127815703304d20edb1a--