Groups > comp.lang.python > #66215 > unrolled thread

XML parsing ExpatError with xml.dom.minidom at line 1, column 0

Started by	ming <hseuming@gmail.com>
First post	2014-02-13 11:27 -0800
Last post	2014-02-13 20:19 +0000
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

  XML parsing ExpatError with xml.dom.minidom at line 1, column 0 ming <hseuming@gmail.com> - 2014-02-13 11:27 -0800
    Re: XML parsing ExpatError with xml.dom.minidom at line 1, column 0 Peter Otten <__peter__@web.de> - 2014-02-13 21:10 +0100
    Re: XML parsing ExpatError with xml.dom.minidom at line 1, column 0 MRAB <python@mrabarnett.plus.com> - 2014-02-13 20:19 +0000

#66215 — XML parsing ExpatError with xml.dom.minidom at line 1, column 0

From	ming <hseuming@gmail.com>
Date	2014-02-13 11:27 -0800
Subject	XML parsing ExpatError with xml.dom.minidom at line 1, column 0
Message-ID	<f4bc13be-60a7-42d7-9123-c134fed45042@googlegroups.com>

Hi,
i've a Python script which stopped working about a month ago.   But until then, it worked flawlessly for months (if not years).   A tiny self-contained 7-line script is provided below.

i ran into an XML parsing problem with xml.dom.minidom and the error message is included below.  The weird thing is if i used an XML validator on the web to validate against this particular URL directly, it is all good.   Moreover, i saved the page source in Firefox or Chrome then validated against the saved XML file, it's also all good.  

Since the error happened at the very beginning of the input (line 1, column 0) as indicated below, i was wondering if this is an encoding mismatch.  However, according to the saved page source in FireFox or Chrome, there is the following at the beginning:
   <?xml version="1.0" encoding="UTF-8"?>


<program>
=================================================
#!/usr/bin/env python

import urllib2
from xml.dom.minidom import parseString

fd = urllib2.urlopen('http://api.worldbank.org/countries')
data = fd.read()
fd.close()
dom = parseString(data)
=================================================


<error msg>
=================================================
Traceback (most recent call last):
  File "./bugReport.py", line 9, in <module>
    dom = parseString(data)
  File "/usr/lib/python2.7/xml/dom/minidom.py", line 1931, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0
=================================================


i'm running Python 2.7.5+ on Ubuntu 13.10.

Thanks.

[toc] | [next] | [standalone]

#66230

From	Peter Otten <__peter__@web.de>
Date	2014-02-13 21:10 +0100
Message-ID	<mailman.6873.1392322243.18130.python-list@python.org>
In reply to	#66215

ming wrote:

> Hi,
> i've a Python script which stopped working about a month ago.   But until
> then, it worked flawlessly for months (if not years).   A tiny
> self-contained 7-line script is provided below.
> 
> i ran into an XML parsing problem with xml.dom.minidom and the error
> message is included below.  The weird thing is if i used an XML validator
> on the web to validate against this particular URL directly, it is all
> good.   Moreover, i saved the page source in Firefox or Chrome then
> validated against the saved XML file, it's also all good.
> 
> Since the error happened at the very beginning of the input (line 1,
> column 0) as indicated below, i was wondering if this is an encoding
> mismatch.  However, according to the saved page source in FireFox or
> Chrome, there is the following at the beginning:
>    <?xml version="1.0" encoding="UTF-8"?>
> 
> 
> <program>
> =================================================
> #!/usr/bin/env python
> 
> import urllib2
> from xml.dom.minidom import parseString
> 
> fd = urllib2.urlopen('http://api.worldbank.org/countries')
> data = fd.read()
> fd.close()
> dom = parseString(data)
> =================================================
> 
> 
> <error msg>
> =================================================
> Traceback (most recent call last):
>   File "./bugReport.py", line 9, in <module>
>     dom = parseString(data)
>   File "/usr/lib/python2.7/xml/dom/minidom.py", line 1931, in parseString
>     return expatbuilder.parseString(string)
>   File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in
>   parseString
>     return builder.parseString(string)
>   File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in
>   parseString
>     parser.Parse(string, True)
> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
> column 0 =================================================
> 
> 
> i'm running Python 2.7.5+ on Ubuntu 13.10.
> 
> Thanks.

Looking into the data returned from the server:

>>> data = urllib2.urlopen("http://api.worldbank.org/countries").read()
>>> with open("tmp.dat", "w") as f: f.write(data)
... 
>>> 
[1]+  Angehalten              python
$ file tmp.dat
tmp.dat: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

OK, let's expand:

$ fg
python


>>> import gzip, StringIO
>>> expanded_data = gzip.GzipFile(fileobj=StringIO.StringIO(data)).read()
>>> import xml.dom.minidom
>>> xml.dom.minidom.parseString(expanded_data)
<xml.dom.minidom.Document instance at 0x19a1320>

There may be a way to uncompress the gzipped data transparently, but I'm too 
lazy to look it up...

[toc] | [prev] | [next] | [standalone]

#66231

From	MRAB <python@mrabarnett.plus.com>
Date	2014-02-13 20:19 +0000
Message-ID	<mailman.6874.1392322769.18130.python-list@python.org>
In reply to	#66215

On 2014-02-13 20:10, Peter Otten wrote:
> ming wrote:
>
>> Hi,
>> i've a Python script which stopped working about a month ago.   But until
>> then, it worked flawlessly for months (if not years).   A tiny
>> self-contained 7-line script is provided below.
>>
>> i ran into an XML parsing problem with xml.dom.minidom and the error
>> message is included below.  The weird thing is if i used an XML validator
>> on the web to validate against this particular URL directly, it is all
>> good.   Moreover, i saved the page source in Firefox or Chrome then
>> validated against the saved XML file, it's also all good.
>>
>> Since the error happened at the very beginning of the input (line 1,
>> column 0) as indicated below, i was wondering if this is an encoding
>> mismatch.  However, according to the saved page source in FireFox or
>> Chrome, there is the following at the beginning:
>>    <?xml version="1.0" encoding="UTF-8"?>
>>
>>
>> <program>
>> =================================================
>> #!/usr/bin/env python
>>
>> import urllib2
>> from xml.dom.minidom import parseString
>>
>> fd = urllib2.urlopen('http://api.worldbank.org/countries')
>> data = fd.read()
>> fd.close()
>> dom = parseString(data)
>> =================================================
>>
>>
>> <error msg>
>> =================================================
>> Traceback (most recent call last):
>>   File "./bugReport.py", line 9, in <module>
>>     dom = parseString(data)
>>   File "/usr/lib/python2.7/xml/dom/minidom.py", line 1931, in parseString
>>     return expatbuilder.parseString(string)
>>   File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in
>>   parseString
>>     return builder.parseString(string)
>>   File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in
>>   parseString
>>     parser.Parse(string, True)
>> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
>> column 0 =================================================
>>
>>
>> i'm running Python 2.7.5+ on Ubuntu 13.10.
>>
>> Thanks.
>
> Looking into the data returned from the server:
>
>>>> data = urllib2.urlopen("http://api.worldbank.org/countries").read()
>>>> with open("tmp.dat", "w") as f: f.write(data)
> ...
>>>>
> [1]+  Angehalten              python
> $ file tmp.dat
> tmp.dat: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)
>
> OK, let's expand:
>
> $ fg
> python
>
>
>>>> import gzip, StringIO
>>>> expanded_data = gzip.GzipFile(fileobj=StringIO.StringIO(data)).read()
>>>> import xml.dom.minidom
>>>> xml.dom.minidom.parseString(expanded_data)
> <xml.dom.minidom.Document instance at 0x19a1320>
>
> There may be a way to uncompress the gzipped data transparently, but I'm too
> lazy to look it up...
>
 From a brief look at the docs, it looks like you can specify the
format. For example, for JSON:

     fd = urlopen('http://api.worldbank.org/countries?format=json')

[toc] | [prev] | [standalone]

csiph-web

XML parsing ExpatError with xml.dom.minidom at line 1, column 0

Contents

#66215 — XML parsing ExpatError with xml.dom.minidom at line 1, column 0

#66230

#66231