Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #60456 > unrolled thread
| Started by | "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> |
|---|---|
| First post | 2013-11-25 14:22 -0800 |
| Last post | 2013-11-27 09:58 -0500 |
| Articles | 7 — 5 participants |
Back to article view | Back to comp.lang.python
parsing nested unbounded XML fields with ElementTree "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2013-11-25 14:22 -0800
Re: parsing nested unbounded XML fields with ElementTree Chris Angelico <rosuav@gmail.com> - 2013-11-26 09:30 +1100
Re: parsing nested unbounded XML fields with ElementTree Stefan Behnel <stefan_ml@behnel.de> - 2013-11-26 08:38 +0100
Re: parsing nested unbounded XML fields with ElementTree Larry Martell <larry.martell@gmail.com> - 2013-11-26 07:23 -0500
Re: parsing nested unbounded XML fields with ElementTree Stefan Behnel <stefan_ml@behnel.de> - 2013-11-26 14:20 +0100
Re: parsing nested unbounded XML fields with ElementTree Neil Cerutti <mr.cerutti@gmail.com> - 2013-11-26 10:27 -0500
Re: parsing nested unbounded XML fields with ElementTree Larry Martell <larry.martell@gmail.com> - 2013-11-27 09:58 -0500
| From | "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> |
|---|---|
| Date | 2013-11-25 14:22 -0800 |
| Subject | parsing nested unbounded XML fields with ElementTree |
| Message-ID | <d75d1f2c-05c6-4fa6-ae0a-28e13de3097a@googlegroups.com> |
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">
When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
nodes = []
def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag
for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)
My problem is that I don't know when I'm done with a node and I should remove a level of nesting. I would think this is a fairly common situation, but I could not find any examples of parsing a file like this. Perhaps I'm going about it completely wrong.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-26 09:30 +1100 |
| Message-ID | <mailman.3197.1385418647.18130.python-list@python.org> |
| In reply to | #60456 |
On Tue, Nov 26, 2013 at 9:22 AM, Larry.Martell@gmail.com <Larry.Martell@gmail.com> wrote: > I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had: > > <Node Name="A"> > <Node Name="B"> > <Node Name="C"> > <Node Name="D"> > <Node Name="E"> First off, please clarify: Are there five corresponding </Node> tags later on? If not, it's not XML, and nesting will have to be defined some other way. Secondly, please get off Google Groups. Your initial post is malformed, and unless you specifically fight the software, your replies will be even more malformed, to the point of being quite annoying. There are many other ways to read a newsgroup, or you can subscribe to the mailing list python-list@python.org, which carries the same content. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2013-11-26 08:38 +0100 |
| Message-ID | <mailman.3218.1385451509.18130.python-list@python.org> |
| In reply to | #60456 |
Larry.Martell...@gmail.com, 25.11.2013 23:22: > I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had: > > <Node Name="A"> > <Node Name="B"> > <Node Name="C"> > <Node Name="D"> > <Node Name="E"> > > When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far: > > nodes = [] > > def parseChild(c): > if c.tag == 'Node': > if 'Name' in c.attrib: > nodes.append(c.attrib['Name']) > for c1 in c: > parseChild(c1) > else: > for node in nodes: > print node, > print c.tag > > for parent in tree.getiterator(): > for child in parent: > for x in child: > parseChild(x) This seems hugely redundant. tree.getiterator() already returns a recursive iterable, and then, for each nodes in your document, you are running recursively over its entire subtree. Meaning that you'll visit each node as many times as its depth in the tree. > My problem is that I don't know when I'm done with a node and I should > remove a level of nesting. I would think this is a fairly common > situation, but I could not find any examples of parsing a file like > this. Perhaps I'm going about it completely wrong. Your recursive traversal function tells you when you're done. If you drop the getiterator() bit, reaching the end of parseChild() means that you're done with the element and start backing up. So you can simply pass down a list of element names that you append() at the beginning of the function and pop() at the end, i.e. a stack. That list will then always give you the current path from the root node. Alternatively, if you want to use lxml.etree instead of ElementTree, you can use it's iterwalk() function, which gives you the same thing but without recursion, as a plain iterator. http://lxml.de/parsing.html#iterparse-and-iterwalk Stefan
[toc] | [prev] | [next] | [standalone]
| From | Larry Martell <larry.martell@gmail.com> |
|---|---|
| Date | 2013-11-26 07:23 -0500 |
| Message-ID | <mailman.3232.1385468604.18130.python-list@python.org> |
| In reply to | #60456 |
On Tue, Nov 26, 2013 at 2:38 AM, Stefan Behnel <stefan_ml@behnel.de> wrote: > Larry.Martell...@gmail.com, 25.11.2013 23:22: >> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had: >> >> <Node Name="A"> >> <Node Name="B"> >> <Node Name="C"> >> <Node Name="D"> >> <Node Name="E"> >> >> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far: >> >> nodes = [] >> >> def parseChild(c): >> if c.tag == 'Node': >> if 'Name' in c.attrib: >> nodes.append(c.attrib['Name']) >> for c1 in c: >> parseChild(c1) >> else: >> for node in nodes: >> print node, >> print c.tag >> >> for parent in tree.getiterator(): >> for child in parent: >> for x in child: >> parseChild(x) > > This seems hugely redundant. tree.getiterator() already returns a recursive > iterable, and then, for each nodes in your document, you are running > recursively over its entire subtree. Meaning that you'll visit each node as > many times as its depth in the tree. > > >> My problem is that I don't know when I'm done with a node and I should >> remove a level of nesting. I would think this is a fairly common >> situation, but I could not find any examples of parsing a file like >> this. Perhaps I'm going about it completely wrong. > > Your recursive traversal function tells you when you're done. If you drop > the getiterator() bit, reaching the end of parseChild() means that you're > done with the element and start backing up. So you can simply pass down a > list of element names that you append() at the beginning of the function > and pop() at the end, i.e. a stack. That list will then always give you the > current path from the root node. Thanks for the reply. How can I remove getiterator()? Then I won't be traversing the nodes of the tree. I can't iterate over tree. I am also unclear on where to do the pop(). I tried putting it just after the recursive call to parseChild() and I tried putting as the very last statement in parseChild() - neither one gave the desired result. Can you show me in code what you mean? Thanks! -larry > > Alternatively, if you want to use lxml.etree instead of ElementTree, you > can use it's iterwalk() function, which gives you the same thing but > without recursion, as a plain iterator. > > http://lxml.de/parsing.html#iterparse-and-iterwalk
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2013-11-26 14:20 +0100 |
| Message-ID | <mailman.3238.1385472043.18130.python-list@python.org> |
| In reply to | #60456 |
Larry Martell, 26.11.2013 13:23:
> On Tue, Nov 26, 2013 at 2:38 AM, Stefan Behnel wrote:
>> Larry.Martell...@gmail.com, 25.11.2013 23:22:
>>> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
>>>
>>> <Node Name="A">
>>> <Node Name="B">
>>> <Node Name="C">
>>> <Node Name="D">
>>> <Node Name="E">
>>>
>>> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
>>>
>>> nodes = []
>>>
>>> def parseChild(c):
>>> if c.tag == 'Node':
>>> if 'Name' in c.attrib:
>>> nodes.append(c.attrib['Name'])
>>> for c1 in c:
>>> parseChild(c1)
>>> else:
>>> for node in nodes:
>>> print node,
>>> print c.tag
>>>
>>> for parent in tree.getiterator():
>>> for child in parent:
>>> for x in child:
>>> parseChild(x)
>>
>> This seems hugely redundant. tree.getiterator() already returns a recursive
>> iterable, and then, for each nodes in your document, you are running
>> recursively over its entire subtree. Meaning that you'll visit each node as
>> many times as its depth in the tree.
>>
>>
>>> My problem is that I don't know when I'm done with a node and I should
>>> remove a level of nesting. I would think this is a fairly common
>>> situation, but I could not find any examples of parsing a file like
>>> this. Perhaps I'm going about it completely wrong.
>>
>> Your recursive traversal function tells you when you're done. If you drop
>> the getiterator() bit, reaching the end of parseChild() means that you're
>> done with the element and start backing up. So you can simply pass down a
>> list of element names that you append() at the beginning of the function
>> and pop() at the end, i.e. a stack. That list will then always give you the
>> current path from the root node.
>
> Thanks for the reply. How can I remove getiterator()? Then I won't be
> traversing the nodes of the tree. I can't iterate over tree. I am also
> unclear on where to do the pop(). I tried putting it just after the
> recursive call to parseChild() and I tried putting as the very last
> statement in parseChild() - neither one gave the desired result. Can
> you show me in code what you mean?
untested:
nodes = []
def process_subtree(c, path):
name = c.get('Name') if c.tag == 'Node' else None
if name:
path.append(name)
nodes.append('/'.join(path))
for c1 in c:
process_subtree(c1, path)
if name:
path.pop()
process_subtree(tree.getroot(), [])
Stefan
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <mr.cerutti@gmail.com> |
|---|---|
| Date | 2013-11-26 10:27 -0500 |
| Message-ID | <mailman.3246.1385479623.18130.python-list@python.org> |
| In reply to | #60456 |
On Mon, Nov 25, 2013 at 5:22 PM, Larry.Martell@gmail.com
<Larry.Martell@gmail.com> wrote:
> I have an XML file that has an element called "Node". These can
> be nested to any depth and the depth of the nesting is not
> known to me. I need to parse the file and preserve the nesting.
> For exmaple, if the XML file had:
>
> <Node Name="A">
> <Node Name="B">
> <Node Name="C">
> <Node Name="D">
> <Node Name="E">
>
> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E.
> Problem is I don't know how deep this can be. This is the code
> I have so far:
I also an ElementTree user, but it's fairly heavy-duty for simple
jobs. I use sax for simple those. In fact, I'm kind of a saxophone.
This is basically the same idea as others have posted.
the_xml = """<?xml version="1.0" encoding="ISO-8859-1"?>
<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">
</Node></Node></Node></Node></Node>"""
import io
import sys
import xml.sax as sax
class NodeHandler(sax.handler.ContentHandler):
def startDocument(self):
self.title = ''
self.names = []
def startElement(self, name, attrs):
self.process(attrs['Name'])
self.names.append(attrs['Name'])
def process(self, name):
print("Node {} Nest {}".format(name, '/'.join(self.names)))
# Do your stuff.
def endElement(self, name):
self.names.pop()
print(sys.version_info)
handler = NodeHandler()
parser = sax.parse(io.StringIO(the_xml), handler)
Output:
sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
Node A Nest
Node B Nest A
Node C Nest A/B
Node D Nest A/B/C
Node E Nest A/B/C/D
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Larry Martell <larry.martell@gmail.com> |
|---|---|
| Date | 2013-11-27 09:58 -0500 |
| Message-ID | <mailman.3302.1385564319.18130.python-list@python.org> |
| In reply to | #60456 |
On Tue, Nov 26, 2013 at 8:20 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
> Larry Martell, 26.11.2013 13:23:
>> On Tue, Nov 26, 2013 at 2:38 AM, Stefan Behnel wrote:
>>> Larry.Martell...@gmail.com, 25.11.2013 23:22:
>>>> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
>>>>
>>>> <Node Name="A">
>>>> <Node Name="B">
>>>> <Node Name="C">
>>>> <Node Name="D">
>>>> <Node Name="E">
>>>>
>>>> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
>>>>
>>>> nodes = []
>>>>
>>>> def parseChild(c):
>>>> if c.tag == 'Node':
>>>> if 'Name' in c.attrib:
>>>> nodes.append(c.attrib['Name'])
>>>> for c1 in c:
>>>> parseChild(c1)
>>>> else:
>>>> for node in nodes:
>>>> print node,
>>>> print c.tag
>>>>
>>>> for parent in tree.getiterator():
>>>> for child in parent:
>>>> for x in child:
>>>> parseChild(x)
>>>
>>> This seems hugely redundant. tree.getiterator() already returns a recursive
>>> iterable, and then, for each nodes in your document, you are running
>>> recursively over its entire subtree. Meaning that you'll visit each node as
>>> many times as its depth in the tree.
>>>
>>>
>>>> My problem is that I don't know when I'm done with a node and I should
>>>> remove a level of nesting. I would think this is a fairly common
>>>> situation, but I could not find any examples of parsing a file like
>>>> this. Perhaps I'm going about it completely wrong.
>>>
>>> Your recursive traversal function tells you when you're done. If you drop
>>> the getiterator() bit, reaching the end of parseChild() means that you're
>>> done with the element and start backing up. So you can simply pass down a
>>> list of element names that you append() at the beginning of the function
>>> and pop() at the end, i.e. a stack. That list will then always give you the
>>> current path from the root node.
>>
>> Thanks for the reply. How can I remove getiterator()? Then I won't be
>> traversing the nodes of the tree. I can't iterate over tree. I am also
>> unclear on where to do the pop(). I tried putting it just after the
>> recursive call to parseChild() and I tried putting as the very last
>> statement in parseChild() - neither one gave the desired result. Can
>> you show me in code what you mean?
>
> untested:
>
> nodes = []
>
> def process_subtree(c, path):
> name = c.get('Name') if c.tag == 'Node' else None
> if name:
> path.append(name)
> nodes.append('/'.join(path))
>
> for c1 in c:
> process_subtree(c1, path)
>
> if name:
> path.pop()
>
> process_subtree(tree.getroot(), [])
Thanks! This was extremely helpful and I've use these concepts to
write script that successfully parses my file.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web