Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #35798
| Path | csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <morten.guldager@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.003 |
| X-Spam-Evidence | '*H*': 0.99; '*S*': 0.00; '"""': 0.05; 'preferably': 0.05; 'column': 0.07; 'python': 0.09; '""")': 0.09; '8bit%:2': 0.09; 'subject:trying': 0.09; 'subject:using': 0.09; 'properly': 0.15; 'framework,': 0.16; 'tweak': 0.16; 'work.\xa0': 0.16; '8bit%:3': 0.17; 'skip:u 30': 0.17; 'trying': 0.21; 'import': 0.21; 'work.': 0.23; 'question': 0.27; 'tree': 0.27; 'message- id:@mail.gmail.com': 0.27; 'dom': 0.29; '8bit%:5': 0.29; 'included': 0.29; 'skip:& 10': 0.29; "i'm": 0.29; 'structure': 0.32; 'html,': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'received:google.com': 0.34; 'similar': 0.35; 'received:209.85': 0.35; 'loaded': 0.36; 'possible': 0.37; 'received:209': 0.37; 'some': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'easy': 0.60; 'real': 0.61; 'below.': 0.68; 'tags,': 0.81; '<img': 0.84 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=1LJPPyIv/OZHXA9ip3q6odIFtvsBUL7mWtTs9Jl1sS0=; b=d9oFwJVBiXte/YQHB3hHeiibD7VawlHf6VOwRUiyWz1wIdZd29b+x5mGlk2XJq88OJ VlV1vohTxCWdo1nryPfTnsQOqRWoufmAZe0VFZICpTtNt8mjY36sJhjbJmHj4OutEpmH dhu/RjI5lousbfWFPrxFaSlTDLKxLz5WvNsW6HpCLTKKGfxF/2CNOgZ4/RhMD9pCgQEQ /Gee6Hb3i2GTHewwXfRwYJ0VXVMuSMiS2CTnZpa8RdUtHaCPEjdvck+kCrQKR1orIb4+ 25MeCbnG8RaUa9SkeXfIq36i81H54JvfoZRZRL+XQD2qqn7sl7uV1g0/aNhsPpxhSh0P 4X5w== |
| MIME-Version | 1.0 |
| Date | Sun, 30 Dec 2012 10:52:33 +0100 |
| Subject | Noob trying to parse bad HTML using xml.etree.ElementTree |
| From | Morten Guldager <morten.guldager@gmail.com> |
| To | python-list@python.org |
| Content-Type | multipart/alternative; boundary=14dae9d7127815703304d20edb1a |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1460.1356861162.29569.python-list@python.org> (permalink) |
| Lines | 63 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1356861162 news.xs4all.nl 6962 [2001:888:2000:d::a6]:52828 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:35798 |
Show key headers only | View raw
[Multipart message — attachments visible in raw view] - view raw
'Aloha Friends! I'm trying to process some HTML using xml.etree.ElementTree Problem is that the HTML I'm trying to read have some not properly closed tags, as the <img> shown in line 8 below. 1 from xml.etree import ElementTree 2 3 tree = ElementTree 4 e = tree.fromstring( 5 """ 6 <html> 7 <body> 8 <img src='mogul.jpg'> 9 </body> 10 </html> 11 """) Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5, column 14 I definitely do want to work DOM style, having the whole shebang loaded into a nice structure before I start the real work. Question is if it's possible to tweak xml.etree.ElementTree to accept, and understand sloppy html, or if you have suggestions for similar easy to use framework, preferably among the included batteries? -- /Morten %-)
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Noob trying to parse bad HTML using xml.etree.ElementTree Morten Guldager <morten.guldager@gmail.com> - 2012-12-30 10:52 +0100
csiph-web