Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #4146

ElementTree XML parsing problem

From Mike <Mike@invalid.invalid>
Newsgroups comp.lang.python
Subject ElementTree XML parsing problem
Date 2011-04-27 11:26 -0700
Organization A noiseless patient Spider
Message-ID <ip9n72$ol6$1@dont-email.me> (permalink)

Show all headers | View raw


I'm using ElementTree to parse an XML file, but it stops at the second 
record (id = 002), which contains a non-standard ascii character, ä. 
Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid 
token): line 5, column 40

and if I change the line to eliminate the ä, everything is wonderful. 
The parser is perfectly happy with this modification:

<record id="002" education="University Bremen" employment="3 yrs" />

I can't find anything in the ElementTree docs about allowing additional 
text characters or coercing strange ascii to Unicode.

Is there a way to coerce the text so it doesn't cause the parser to 
raise an exception?

Here's my test script (simple_fail contains the offending line, and 
simple_pass contains the line that passes).

import sys
import xml.etree.ElementTree as ET

def main():

     xml_files = ['simple_fail.xml', 'simple_pass.xml']
     for xml_file in xml_files:

         print
         print 'XML file: %s' % (xml_file)

         try:
             tree = ET.parse(xml_file)
         except Exception, inst:
             print "Unexpected error opening %s: %s" % (xml_file, inst)
             continue

         root = tree.getroot()
         records = root.find('records')
         for record in records:
             print record.attrib['id'], record.attrib['education']

if __name__ == "__main__":
	main()


Thanks,

-- Mike --

Back to comp.lang.python | Previous | NextNext in thread | Find similar


Thread

ElementTree XML parsing problem Mike <Mike@invalid.invalid> - 2011-04-27 11:26 -0700
  Re: ElementTree XML parsing problem Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-04-27 14:41 -0400
  Re: ElementTree XML parsing problem Neil Cerutti <neilc@norwich.edu> - 2011-04-27 19:24 +0000
    Re: ElementTree XML parsing problem Mike <Mike@invalid.invalid> - 2011-04-27 13:43 -0700
  Re: ElementTree XML parsing problem Philip Semanchuk <philip@semanchuk.com> - 2011-04-27 15:32 -0400
  Re: ElementTree XML parsing problem Hegedüs Ervin <airween@gmail.com> - 2011-04-27 21:33 +0200
    Re: ElementTree XML parsing problem Mike <Mike@invalid.invalid> - 2011-04-27 13:32 -0700
  Re: ElementTree XML parsing problem Stefan Behnel <stefan_ml@behnel.de> - 2011-04-28 07:57 +0200
  Re: ElementTree XML parsing problem Ervin Hegedüs <airween@gmail.com> - 2011-04-28 08:24 +0200

csiph-web