Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Stefan Behnel <stefan_ml@behnel.de>
Subject: Re: XML/XHTML/HTML differences, bugs... and howto
Date: Thu, 24 Jan 2013 07:42:22 +0100
References: <5100002F.7020809@r3dsolutions.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2
In-Reply-To: <5100002F.7020809@r3dsolutions.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.945.1359009757.2939.python-list@python.org>
Lines: 81
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:37542

Andrew Robinson, 23.01.2013 16:22:
> Good day :),
> 
> I've been exploring XML parsers in python; particularly:
> xml.etree.cElementTree; and I'm trying to figure out how to do it
> incrementally, for very large XML files -- although I don't think the
> problems are restricted to incremental parsing.
> 
> First problem:
> I've come across an issue where etree silently drops text without telling
> me; and separate.
> 
> I am under the impression that XHTML is a subset of XML (eg:defined tags),
> and that once an HTML file is converted to XHTML, the body of the document
> can be handled entirely as XML.
> 
> If I convert a (partial/contrived) html file like:
> 
> <html>
>     <div>
>         <p> This is example <b>bold</b> text.
>     </div>
> </html>
> 
> to XHTML, I might do --right or wrong-- (1):
> 
> <html>
>     <div>
>         <p /> This is example <b>bold</b> text.
>     </div>
> </html>
> 
> or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
> 
> But, when I parse with etree,  in example (1) both "This is an example" and
> "text." are dropped;
> The missing text is part of the start, or end event tags, in the
> incrementally parsed method.
> 
> Likewise: In example (2), only "text" gets dropped.

Nope, you should read the manual on this. Here's a tutorial:

http://lxml.de/tutorial.html#elements-contain-text

This is using lxml.etree, which is the Python XML library most people use
these days. It's ElementTree compatible, so the tutorial also works for ET
(unless stated otherwise).


> Isn't XML supposed to error out when invalid xml is parsed?

It does.


> I have an XML file which will grow larger than memory on a target machine,
> so here's what I want to do:
> 
> Given a source XML file, and a destination file:
> 1) iteratively scan part of the source tree.
> 2) Optionally Modify some of scanned tree.
> 3) Write partial scan/tree out to the destination file.
> 4) Free memory of no-longer needed (partial) source XML.
> 5) continue scanning a new section of the source file... eg: goto step 1
> until source file is exhausted.
> 
> But, I don't see a way to write portions of an XML tree, or iteratively
> write a tree to disk.
> How can this be done?

There are several ways to do it. Python has a couple of external libraries
available that are made specifically for generating markup incrementally.

lxml also gained that feature recently. It's not documented yet, but here
are usage examples:

https://github.com/lxml/lxml/blob/master/src/lxml/tests/test_incremental_xmlfile.py

Stefan