Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Neil Cerutti Newsgroups: comp.lang.python Subject: Re: ElementTree XML parsing problem Date: 27 Apr 2011 19:24:52 GMT Organization: Norwich University Lines: 43 Message-ID: <91r8s4Fk28U4@mid.individual.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: individual.net 3mXbsTQcBvLgXD8bMrVm/QHRCZg78bQSr/M5JS4brJzQfK7Jld Cancel-Lock: sha1:mtPjajMx5c8dVoFUJnN++OOCMTU= User-Agent: slrn/0.9.9p1/mm/ao (Win32) Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:4153 On 2011-04-27, Mike wrote: > I'm using ElementTree to parse an XML file, but it stops at the > second record (id = 002), which contains a non-standard ascii > character, ?. Here's the XML: > > > > > > > > > > > The complaint offered up by the parser is > > Unexpected error opening simple_fail.xml: not well-formed > (invalid token): line 5, column 40 It seems to be an invalid XML document, as another poster indicated. > and if I change the line to eliminate the ?, everything is > wonderful. The parser is perfectly happy with this > modification: > > > > I can't find anything in the ElementTree docs about allowing > additional text characters or coercing strange ascii to > Unicode. If you're not the one generating that bogus file, then you can specify the encoding yourself instead by declaring an XMLParser. import xml.etree.ElementTree as etree with open('file.xml') as xml_file: parser = etree.XMLParser(encoding='ISO-8859-1') root = etree.parse(xml_file, parser=parser).getroot() -- Neil Cerutti