Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #4146
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail |
|---|---|
| From | Mike <Mike@invalid.invalid> |
| Newsgroups | comp.lang.python |
| Subject | ElementTree XML parsing problem |
| Date | Wed, 27 Apr 2011 11:26:05 -0700 |
| Organization | A noiseless patient Spider |
| Lines | 62 |
| Message-ID | <ip9n72$ol6$1@dont-email.me> (permalink) |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 8bit |
| Injection-Date | Wed, 27 Apr 2011 18:27:46 +0000 (UTC) |
| Injection-Info | mx01.eternal-september.org; posting-host="vSsFyBsvl7GwShyxuYLADQ"; logging-data="25254"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+k1oLrerb7fboxuaTjYzTW5QJ24X/minQ=" |
| User-Agent | Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9 |
| Cancel-Lock | sha1:YIPfPPXp1MMVmMmUJ9u5MZGJjFg= |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:4146 |
Show key headers only | View raw
I'm using ElementTree to parse an XML file, but it stops at the second
record (id = 002), which contains a non-standard ascii character, ä.
Here's the XML:
<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>
The complaint offered up by the parser is
Unexpected error opening simple_fail.xml: not well-formed (invalid
token): line 5, column 40
and if I change the line to eliminate the ä, everything is wonderful.
The parser is perfectly happy with this modification:
<record id="002" education="University Bremen" employment="3 yrs" />
I can't find anything in the ElementTree docs about allowing additional
text characters or coercing strange ascii to Unicode.
Is there a way to coerce the text so it doesn't cause the parser to
raise an exception?
Here's my test script (simple_fail contains the offending line, and
simple_pass contains the line that passes).
import sys
import xml.etree.ElementTree as ET
def main():
xml_files = ['simple_fail.xml', 'simple_pass.xml']
for xml_file in xml_files:
print
print 'XML file: %s' % (xml_file)
try:
tree = ET.parse(xml_file)
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
continue
root = tree.getroot()
records = root.find('records')
for record in records:
print record.attrib['id'], record.attrib['education']
if __name__ == "__main__":
main()
Thanks,
-- Mike --
Back to comp.lang.python | Previous | Next — Next in thread | Find similar
ElementTree XML parsing problem Mike <Mike@invalid.invalid> - 2011-04-27 11:26 -0700
Re: ElementTree XML parsing problem Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-04-27 14:41 -0400
Re: ElementTree XML parsing problem Neil Cerutti <neilc@norwich.edu> - 2011-04-27 19:24 +0000
Re: ElementTree XML parsing problem Mike <Mike@invalid.invalid> - 2011-04-27 13:43 -0700
Re: ElementTree XML parsing problem Philip Semanchuk <philip@semanchuk.com> - 2011-04-27 15:32 -0400
Re: ElementTree XML parsing problem Hegedüs Ervin <airween@gmail.com> - 2011-04-27 21:33 +0200
Re: ElementTree XML parsing problem Mike <Mike@invalid.invalid> - 2011-04-27 13:32 -0700
Re: ElementTree XML parsing problem Stefan Behnel <stefan_ml@behnel.de> - 2011-04-28 07:57 +0200
Re: ElementTree XML parsing problem Ervin Hegedüs <airween@gmail.com> - 2011-04-28 08:24 +0200
csiph-web