Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #37542

Re: XML/XHTML/HTML differences, bugs... and howto

From Stefan Behnel <stefan_ml@behnel.de>
Subject Re: XML/XHTML/HTML differences, bugs... and howto
Date 2013-01-24 07:42 +0100
References <5100002F.7020809@r3dsolutions.com>
Newsgroups comp.lang.python
Message-ID <mailman.945.1359009757.2939.python-list@python.org> (permalink)

Show all headers | View raw


Andrew Robinson, 23.01.2013 16:22:
> Good day :),
> 
> I've been exploring XML parsers in python; particularly:
> xml.etree.cElementTree; and I'm trying to figure out how to do it
> incrementally, for very large XML files -- although I don't think the
> problems are restricted to incremental parsing.
> 
> First problem:
> I've come across an issue where etree silently drops text without telling
> me; and separate.
> 
> I am under the impression that XHTML is a subset of XML (eg:defined tags),
> and that once an HTML file is converted to XHTML, the body of the document
> can be handled entirely as XML.
> 
> If I convert a (partial/contrived) html file like:
> 
> <html>
>     <div>
>         <p> This is example <b>bold</b> text.
>     </div>
> </html>
> 
> to XHTML, I might do --right or wrong-- (1):
> 
> <html>
>     <div>
>         <p /> This is example <b>bold</b> text.
>     </div>
> </html>
> 
> or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
> 
> But, when I parse with etree,  in example (1) both "This is an example" and
> "text." are dropped;
> The missing text is part of the start, or end event tags, in the
> incrementally parsed method.
> 
> Likewise: In example (2), only "text" gets dropped.

Nope, you should read the manual on this. Here's a tutorial:

http://lxml.de/tutorial.html#elements-contain-text

This is using lxml.etree, which is the Python XML library most people use
these days. It's ElementTree compatible, so the tutorial also works for ET
(unless stated otherwise).


> Isn't XML supposed to error out when invalid xml is parsed?

It does.


> I have an XML file which will grow larger than memory on a target machine,
> so here's what I want to do:
> 
> Given a source XML file, and a destination file:
> 1) iteratively scan part of the source tree.
> 2) Optionally Modify some of scanned tree.
> 3) Write partial scan/tree out to the destination file.
> 4) Free memory of no-longer needed (partial) source XML.
> 5) continue scanning a new section of the source file... eg: goto step 1
> until source file is exhausted.
> 
> But, I don't see a way to write portions of an XML tree, or iteratively
> write a tree to disk.
> How can this be done?

There are several ways to do it. Python has a couple of external libraries
available that are made specifically for generating markup incrementally.

lxml also gained that feature recently. It's not documented yet, but here
are usage examples:

https://github.com/lxml/lxml/blob/master/src/lxml/tests/test_incremental_xmlfile.py

Stefan

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: XML/XHTML/HTML differences, bugs... and howto Stefan Behnel <stefan_ml@behnel.de> - 2013-01-24 07:42 +0100

csiph-web