Groups > comp.lang.python > #60456 > unrolled thread

parsing nested unbounded XML fields with ElementTree

Started by	"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
First post	2013-11-25 14:22 -0800
Last post	2013-11-27 09:58 -0500
Articles	7 — 5 participants

Back to article view | Back to comp.lang.python

  parsing nested unbounded XML fields with ElementTree "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2013-11-25 14:22 -0800
    Re: parsing nested unbounded XML fields with ElementTree Chris Angelico <rosuav@gmail.com> - 2013-11-26 09:30 +1100
    Re: parsing nested unbounded XML fields with ElementTree Stefan Behnel <stefan_ml@behnel.de> - 2013-11-26 08:38 +0100
    Re: parsing nested unbounded XML fields with ElementTree Larry Martell <larry.martell@gmail.com> - 2013-11-26 07:23 -0500
    Re: parsing nested unbounded XML fields with ElementTree Stefan Behnel <stefan_ml@behnel.de> - 2013-11-26 14:20 +0100
    Re: parsing nested unbounded XML fields with ElementTree Neil Cerutti <mr.cerutti@gmail.com> - 2013-11-26 10:27 -0500
    Re: parsing nested unbounded XML fields with ElementTree Larry Martell <larry.martell@gmail.com> - 2013-11-27 09:58 -0500

#60456 — parsing nested unbounded XML fields with ElementTree

From	"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date	2013-11-25 14:22 -0800
Subject	parsing nested unbounded XML fields with ElementTree
Message-ID	<d75d1f2c-05c6-4fa6-ae0a-28e13de3097a@googlegroups.com>

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
   <Node Name="B">
      <Node Name="C">
        <Node Name="D">
          <Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
    if c.tag == 'Node':
        if 'Name' in c.attrib: 
            nodes.append(c.attrib['Name'])
        for c1 in c:
            parseChild(c1)
    else:
        for node in nodes:
            print node,
        print c.tag

for parent in tree.getiterator():
    for child in parent:
        for x in child:
            parseChild(x)

My problem is that I don't know when I'm done with a node and I should remove a level of nesting. I would think this is a fairly common situation, but I could not find any examples of parsing a file like this. Perhaps I'm going about it completely wrong.

[toc] | [next] | [standalone]

#60457

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-26 09:30 +1100
Message-ID	<mailman.3197.1385418647.18130.python-list@python.org>
In reply to	#60456

On Tue, Nov 26, 2013 at 9:22 AM, Larry.Martell@gmail.com
<Larry.Martell@gmail.com> wrote:
> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
>
> <Node Name="A">
>    <Node Name="B">
>       <Node Name="C">
>         <Node Name="D">
>           <Node Name="E">

First off, please clarify: Are there five corresponding </Node> tags
later on? If not, it's not XML, and nesting will have to be defined
some other way.

Secondly, please get off Google Groups. Your initial post is
malformed, and unless you specifically fight the software, your
replies will be even more malformed, to the point of being quite
annoying. There are many other ways to read a newsgroup, or you can
subscribe to the mailing list python-list@python.org, which carries
the same content.

ChrisA

[toc] | [prev] | [next] | [standalone]

#60488

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2013-11-26 08:38 +0100
Message-ID	<mailman.3218.1385451509.18130.python-list@python.org>
In reply to	#60456

Larry.Martell...@gmail.com, 25.11.2013 23:22:
> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
> 
> <Node Name="A">
>    <Node Name="B">
>       <Node Name="C">
>         <Node Name="D">
>           <Node Name="E">
> 
> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
> 
> nodes = []
> 
> def parseChild(c):
>     if c.tag == 'Node':
>         if 'Name' in c.attrib: 
>             nodes.append(c.attrib['Name'])
>         for c1 in c:
>             parseChild(c1)
>     else:
>         for node in nodes:
>             print node,
>         print c.tag
> 
> for parent in tree.getiterator():
>     for child in parent:
>         for x in child:
>             parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.


> My problem is that I don't know when I'm done with a node and I should
> remove a level of nesting. I would think this is a fairly common
> situation, but I could not find any examples of parsing a file like
> this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Alternatively, if you want to use lxml.etree instead of ElementTree, you
can use it's iterwalk() function, which gives you the same thing but
without recursion, as a plain iterator.

http://lxml.de/parsing.html#iterparse-and-iterwalk

Stefan

[toc] | [prev] | [next] | [standalone]

#60506

From	Larry Martell <larry.martell@gmail.com>
Date	2013-11-26 07:23 -0500
Message-ID	<mailman.3232.1385468604.18130.python-list@python.org>
In reply to	#60456

On Tue, Nov 26, 2013 at 2:38 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
> Larry.Martell...@gmail.com, 25.11.2013 23:22:
>> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
>>
>> <Node Name="A">
>>    <Node Name="B">
>>       <Node Name="C">
>>         <Node Name="D">
>>           <Node Name="E">
>>
>> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
>>
>> nodes = []
>>
>> def parseChild(c):
>>     if c.tag == 'Node':
>>         if 'Name' in c.attrib:
>>             nodes.append(c.attrib['Name'])
>>         for c1 in c:
>>             parseChild(c1)
>>     else:
>>         for node in nodes:
>>             print node,
>>         print c.tag
>>
>> for parent in tree.getiterator():
>>     for child in parent:
>>         for x in child:
>>             parseChild(x)
>
> This seems hugely redundant. tree.getiterator() already returns a recursive
> iterable, and then, for each nodes in your document, you are running
> recursively over its entire subtree. Meaning that you'll visit each node as
> many times as its depth in the tree.
>
>
>> My problem is that I don't know when I'm done with a node and I should
>> remove a level of nesting. I would think this is a fairly common
>> situation, but I could not find any examples of parsing a file like
>> this. Perhaps I'm going about it completely wrong.
>
> Your recursive traversal function tells you when you're done. If you drop
> the getiterator() bit, reaching the end of parseChild() means that you're
> done with the element and start backing up. So you can simply pass down a
> list of element names that you append() at the beginning of the function
> and pop() at the end, i.e. a stack. That list will then always give you the
> current path from the root node.

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

Thanks!
-larry

>
> Alternatively, if you want to use lxml.etree instead of ElementTree, you
> can use it's iterwalk() function, which gives you the same thing but
> without recursion, as a plain iterator.
>
> http://lxml.de/parsing.html#iterparse-and-iterwalk

[toc] | [prev] | [next] | [standalone]

#60512

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2013-11-26 14:20 +0100
Message-ID	<mailman.3238.1385472043.18130.python-list@python.org>
In reply to	#60456

Larry Martell, 26.11.2013 13:23:
> On Tue, Nov 26, 2013 at 2:38 AM, Stefan Behnel wrote:
>> Larry.Martell...@gmail.com, 25.11.2013 23:22:
>>> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
>>>
>>> <Node Name="A">
>>>    <Node Name="B">
>>>       <Node Name="C">
>>>         <Node Name="D">
>>>           <Node Name="E">
>>>
>>> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
>>>
>>> nodes = []
>>>
>>> def parseChild(c):
>>>     if c.tag == 'Node':
>>>         if 'Name' in c.attrib:
>>>             nodes.append(c.attrib['Name'])
>>>         for c1 in c:
>>>             parseChild(c1)
>>>     else:
>>>         for node in nodes:
>>>             print node,
>>>         print c.tag
>>>
>>> for parent in tree.getiterator():
>>>     for child in parent:
>>>         for x in child:
>>>             parseChild(x)
>>
>> This seems hugely redundant. tree.getiterator() already returns a recursive
>> iterable, and then, for each nodes in your document, you are running
>> recursively over its entire subtree. Meaning that you'll visit each node as
>> many times as its depth in the tree.
>>
>>
>>> My problem is that I don't know when I'm done with a node and I should
>>> remove a level of nesting. I would think this is a fairly common
>>> situation, but I could not find any examples of parsing a file like
>>> this. Perhaps I'm going about it completely wrong.
>>
>> Your recursive traversal function tells you when you're done. If you drop
>> the getiterator() bit, reaching the end of parseChild() means that you're
>> done with the element and start backing up. So you can simply pass down a
>> list of element names that you append() at the beginning of the function
>> and pop() at the end, i.e. a stack. That list will then always give you the
>> current path from the root node.
> 
> Thanks for the reply. How can I remove getiterator()? Then I won't be
> traversing the nodes of the tree. I can't iterate over tree. I am also
> unclear on where to do the pop(). I tried putting it just after the
> recursive call to parseChild() and I tried putting as the very last
> statement in parseChild() - neither one gave the desired result. Can
> you show me in code what you mean?

untested:

  nodes = []

  def process_subtree(c, path):
      name = c.get('Name') if c.tag == 'Node' else None
      if name:
          path.append(name)
          nodes.append('/'.join(path))

      for c1 in c:
          process_subtree(c1, path)

      if name:
          path.pop()

  process_subtree(tree.getroot(), [])


Stefan

[toc] | [prev] | [next] | [standalone]

#60521

From	Neil Cerutti <mr.cerutti@gmail.com>
Date	2013-11-26 10:27 -0500
Message-ID	<mailman.3246.1385479623.18130.python-list@python.org>
In reply to	#60456

On Mon, Nov 25, 2013 at 5:22 PM, Larry.Martell@gmail.com
<Larry.Martell@gmail.com> wrote:
> I have an XML file that has an element called "Node". These can
> be nested to any depth and the depth of the nesting is not
> known to me. I need to parse the file and preserve the nesting.
> For exmaple, if the XML file had:
>
> <Node Name="A">
>    <Node Name="B">
>       <Node Name="C">
>         <Node Name="D">
>           <Node Name="E">
>
> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E.
> Problem is I don't know how deep this can be. This is the code
> I have so far:

I also an ElementTree user, but it's fairly heavy-duty for simple
jobs. I use sax for simple those. In fact, I'm kind of a saxophone.
This is basically the same idea as others have posted.

the_xml = """<?xml version="1.0" encoding="ISO-8859-1"?>
<Node Name="A">
   <Node Name="B">
      <Node Name="C">
        <Node Name="D">
          <Node Name="E">
          </Node></Node></Node></Node></Node>"""
import io
import sys
import xml.sax as sax


class NodeHandler(sax.handler.ContentHandler):
    def startDocument(self):
        self.title = ''
        self.names = []

    def startElement(self, name, attrs):
        self.process(attrs['Name'])
        self.names.append(attrs['Name'])

    def process(self, name):
        print("Node {} Nest {}".format(name, '/'.join(self.names)))
        # Do your stuff.

    def endElement(self, name):
        self.names.pop()


print(sys.version_info)
handler = NodeHandler()
parser = sax.parse(io.StringIO(the_xml), handler)

Output:
sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
Node A Nest
Node B Nest A
Node C Nest A/B
Node D Nest A/B/C
Node E Nest A/B/C/D

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#60613

From	Larry Martell <larry.martell@gmail.com>
Date	2013-11-27 09:58 -0500
Message-ID	<mailman.3302.1385564319.18130.python-list@python.org>
In reply to	#60456

On Tue, Nov 26, 2013 at 8:20 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
> Larry Martell, 26.11.2013 13:23:
>> On Tue, Nov 26, 2013 at 2:38 AM, Stefan Behnel wrote:
>>> Larry.Martell...@gmail.com, 25.11.2013 23:22:
>>>> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
>>>>
>>>> <Node Name="A">
>>>>    <Node Name="B">
>>>>       <Node Name="C">
>>>>         <Node Name="D">
>>>>           <Node Name="E">
>>>>
>>>> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
>>>>
>>>> nodes = []
>>>>
>>>> def parseChild(c):
>>>>     if c.tag == 'Node':
>>>>         if 'Name' in c.attrib:
>>>>             nodes.append(c.attrib['Name'])
>>>>         for c1 in c:
>>>>             parseChild(c1)
>>>>     else:
>>>>         for node in nodes:
>>>>             print node,
>>>>         print c.tag
>>>>
>>>> for parent in tree.getiterator():
>>>>     for child in parent:
>>>>         for x in child:
>>>>             parseChild(x)
>>>
>>> This seems hugely redundant. tree.getiterator() already returns a recursive
>>> iterable, and then, for each nodes in your document, you are running
>>> recursively over its entire subtree. Meaning that you'll visit each node as
>>> many times as its depth in the tree.
>>>
>>>
>>>> My problem is that I don't know when I'm done with a node and I should
>>>> remove a level of nesting. I would think this is a fairly common
>>>> situation, but I could not find any examples of parsing a file like
>>>> this. Perhaps I'm going about it completely wrong.
>>>
>>> Your recursive traversal function tells you when you're done. If you drop
>>> the getiterator() bit, reaching the end of parseChild() means that you're
>>> done with the element and start backing up. So you can simply pass down a
>>> list of element names that you append() at the beginning of the function
>>> and pop() at the end, i.e. a stack. That list will then always give you the
>>> current path from the root node.
>>
>> Thanks for the reply. How can I remove getiterator()? Then I won't be
>> traversing the nodes of the tree. I can't iterate over tree. I am also
>> unclear on where to do the pop(). I tried putting it just after the
>> recursive call to parseChild() and I tried putting as the very last
>> statement in parseChild() - neither one gave the desired result. Can
>> you show me in code what you mean?
>
> untested:
>
>   nodes = []
>
>   def process_subtree(c, path):
>       name = c.get('Name') if c.tag == 'Node' else None
>       if name:
>           path.append(name)
>           nodes.append('/'.join(path))
>
>       for c1 in c:
>           process_subtree(c1, path)
>
>       if name:
>           path.pop()
>
>   process_subtree(tree.getroot(), [])

Thanks! This was extremely helpful and I've use these concepts to
write script that successfully parses my file.

[toc] | [prev] | [standalone]

csiph-web

parsing nested unbounded XML fields with ElementTree

Contents

#60456 — parsing nested unbounded XML fields with ElementTree

#60457

#60488

#60506

#60512

#60521

#60613