Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Mike Newsgroups: comp.lang.python Subject: Re: ElementTree XML parsing problem Date: Wed, 27 Apr 2011 13:43:20 -0700 Organization: A noiseless patient Spider Lines: 49 Message-ID: References: <91r8s4Fk28U4@mid.individual.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 27 Apr 2011 20:45:00 +0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="vSsFyBsvl7GwShyxuYLADQ"; logging-data="3034"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/BKArSFQVAFpK+w7uwQTcLeXVIB0EjBkc=" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9 In-Reply-To: <91r8s4Fk28U4@mid.individual.net> Cancel-Lock: sha1:4G9Bv39V70CVXS2Z4si/rF7CaBg= Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:4163 On 4/27/2011 12:24 PM, Neil Cerutti wrote: > On 2011-04-27, Mike wrote: >> I'm using ElementTree to parse an XML file, but it stops at the >> second record (id = 002), which contains a non-standard ascii >> character, ?. Here's the XML: >> >> >> >> >> >> >> >> >> >> >> The complaint offered up by the parser is >> >> Unexpected error opening simple_fail.xml: not well-formed >> (invalid token): line 5, column 40 > > It seems to be an invalid XML document, as another poster > indicated. > >> and if I change the line to eliminate the ?, everything is >> wonderful. The parser is perfectly happy with this >> modification: >> >> >> >> I can't find anything in the ElementTree docs about allowing >> additional text characters or coercing strange ascii to >> Unicode. > > If you're not the one generating that bogus file, then you can > specify the encoding yourself instead by declaring an XMLParser. > > import xml.etree.ElementTree as etree > with open('file.xml') as xml_file: > parser = etree.XMLParser(encoding='ISO-8859-1') > root = etree.parse(xml_file, parser=parser).getroot() > Thanks, Neil. I'm not generating the file, just trying to parse it. Your solution is precisely what I was looking for, even if I didn't quite ask correctly. I appreciate the help! -- Mike --