Groups > comp.lang.python > #32299 > unrolled thread

problems with xml parsing (python 3.3)

Started by	jannidis@gmail.com
First post	2012-10-27 19:27 -0700
Last post	2012-10-30 05:37 -0700
Articles	6 — 3 participants

Back to article view | Back to comp.lang.python

  problems with xml parsing (python 3.3) jannidis@gmail.com - 2012-10-27 19:27 -0700
    Re: problems with xml parsing (python 3.3) jannidis@gmail.com - 2012-10-27 19:30 -0700
    Re: problems with xml parsing (python 3.3) MRAB <python@mrabarnett.plus.com> - 2012-10-28 03:08 +0000
    Re: problems with xml parsing (python 3.3) Dieter Maurer <dieter@handshake.de> - 2012-10-28 08:30 +0100
    Re: problems with xml parsing (python 3.3) jannidis@gmail.com - 2012-10-29 15:54 -0700
    Re: problems with xml parsing (python 3.3) jannidis@gmail.com - 2012-10-30 05:37 -0700

#32299 — problems with xml parsing (python 3.3)

From	jannidis@gmail.com
Date	2012-10-27 19:27 -0700
Subject	problems with xml parsing (python 3.3)
Message-ID	<97d8de0d-3daa-49be-a91f-c65fc8a9019f@googlegroups.com>

Hello all, 

I am new to Python and have a problem with the behaviour of the xml parser. Assume we have this xml document:

<?xml version="1.0" encoding="UTF-8"?>
<bibliography>
    <entry>
            Title of the first book.
        </entry>
        <entry>
            <coauthored/>
Title of the second book.
        </entry>
</bibliography>    


If I now check for the text of all 'entry' nodes, the text for the node with the empty element isn't shown



import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='test.xml')
root = tree.getroot()
resultSet = root.findall(".//entry")
for r in resultSet:
	print (r.text)

[toc] | [next] | [standalone]

#32300

From	jannidis@gmail.com
Date	2012-10-27 19:30 -0700
Message-ID	<8937bca3-2443-4ee9-aea5-6f1924bdcf66@googlegroups.com>
In reply to	#32299

To my understanding the empty element is a child of entry as is the text node. 
Is there anything I am doing wrong here? Any help is appreciated,

Fotis

[toc] | [prev] | [next] | [standalone]

#32301

From	MRAB <python@mrabarnett.plus.com>
Date	2012-10-28 03:08 +0000
Message-ID	<mailman.2949.1351393739.27098.python-list@python.org>
In reply to	#32299

On 2012-10-28 02:27, jannidis@gmail.com wrote:
> Hello all,
>
> I am new to Python and have a problem with the behaviour of the xml parser. Assume we have this xml document:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <bibliography>
>      <entry>
>              Title of the first book.
>          </entry>
>          <entry>
>              <coauthored/>
> Title of the second book.
>          </entry>
> </bibliography>
>
>
> If I now check for the text of all 'entry' nodes, the text for the node with the empty element isn't shown
>
>
>
> import xml.etree.ElementTree as ET
> tree = ET.ElementTree(file='test.xml')
> root = tree.getroot()
> resultSet = root.findall(".//entry")
> for r in resultSet:
> 	print (r.text)
>
It _is_ shown, it's just that it's all whitespace:

 >>> for r in resultSet:
	print(ascii(r.text))

	
'\n            Title of the first book.\n        '
'\n            '

[toc] | [prev] | [next] | [standalone]

#32306

From	Dieter Maurer <dieter@handshake.de>
Date	2012-10-28 08:30 +0100
Message-ID	<mailman.2953.1351409448.27098.python-list@python.org>
In reply to	#32299

jannidis@gmail.com writes:

> I am new to Python and have a problem with the behaviour of the xml parser. Assume we have this xml document:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <bibliography>
>     <entry>
>             Title of the first book.
>         </entry>
>         <entry>
>             <coauthored/>
> Title of the second book.
>         </entry>
> </bibliography>    
>
>
> If I now check for the text of all 'entry' nodes, the text for the node with the empty element isn't shown
>
>
>
> import xml.etree.ElementTree as ET
> tree = ET.ElementTree(file='test.xml')
> root = tree.getroot()
> resultSet = root.findall(".//entry")
> for r in resultSet:
> 	print (r.text)

I do not know about "xml.etree" but the (said) quite compatible
"lxml.etree" handles text nodes in a quite different way from
that of "DOM": they are *not* considered children of the parent
element but are attached as attributes "text" and "tail" to either
the container element (if the first DOM node is a text node) or the preceeding
element, otherwise.

Your code snippet suggests that "xml.etree" behaves identically in
this respect. In this case, you would find "Title of the second book"
as the "tail" attribute of the element "coauthored".

[toc] | [prev] | [next] | [standalone]

#32435

From	jannidis@gmail.com
Date	2012-10-29 15:54 -0700
Message-ID	<07d64ccf-3b8a-4ffa-baa2-5fb566a8a715@googlegroups.com>
In reply to	#32299

Am Sonntag, 28. Oktober 2012 03:27:14 UTC+1 schrieb jann...@gmail.com:
> Hello all, 
> 
> 
> 
> I am new to Python and have a problem with the behaviour of the xml parser. Assume we have this xml document:
> 
> 
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <bibliography>
> 
>     <entry>
> 
>             Title of the first book.
> 
>         </entry>
> 
>         <entry>
> 
>             <coauthored/>
> 
> Title of the second book.
> 
>         </entry>
> 
> </bibliography>    
> 
> 
> 
> 
> 
> If I now check for the text of all 'entry' nodes, the text for the node with the empty element isn't shown
> 
> 
> 
> 
> 
> 
> 
> import xml.etree.ElementTree as ET
> 
> tree = ET.ElementTree(file='test.xml')
> 
> root = tree.getroot()
> 
> resultSet = root.findall(".//entry")
> 
> for r in resultSet:
> 
> 	print (r.text)

thanks a lot for your answer. as I am looking for a tool to teach using xml in programming it is a pity that this modul implements a very idiosyncratic view on xml data, but dom and sax are out there too, so I will look at them.

[toc] | [prev] | [next] | [standalone]

#32484

From	jannidis@gmail.com
Date	2012-10-30 05:37 -0700
Message-ID	<7bfc9cac-4c9a-4ee4-b654-609b4ee93fc2@googlegroups.com>
In reply to	#32299

If someone comes across this posting with the same problem, the best answer seems to be: 
avoid Pythons  xml.etree.ElementTree and use this library instead: 
http://lxml.de/
It works like expected and supports xpath much better.

[toc] | [prev] | [standalone]

csiph-web

problems with xml parsing (python 3.3)

Contents

#32299 — problems with xml parsing (python 3.3)

#32300

#32301

#32306

#32435

#32484