Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #71171

Re: parsing multiple root element XML into text

From Stefan Behnel <stefan_ml@behnel.de>
Subject Re: parsing multiple root element XML into text
Date 2014-05-09 15:55 +0200
References (1 earlier) <87oaz7uvo4.fsf@dpt-info.u-strasbg.fr> <87a9arfdha.fsf@elektro.pacujo.net> <87k39vupnc.fsf@dpt-info.u-strasbg.fr> <8738gjf813.fsf@elektro.pacujo.net> <87y4ybdt46.fsf@elektro.pacujo.net>
Newsgroups comp.lang.python
Message-ID <mailman.9821.1399643762.18130.python-list@python.org> (permalink)

Show all headers | View raw


Marko Rauhamaa, 09.05.2014 14:38:
> Marko Rauhamaa:
>> Alain Ketterlin:
>>> Marko Rauhamaa writes:
>>>> Sometimes the XML elements come through a pipe as an endless
>>>> sequence. You can still use the wrapping technique and a SAX parser.
>>>> However, the other option is to write a tiny XML scanner that
>>>> identifies the end of each element. Then, you can cut out the
>>>> complete XML element and hand it over to a DOM parser.
>>>
>>> Well maybe, even though I see no point in doing so. If the whole
>>> transaction is a single document and you need to get sub-elements on
>>> the fly, just use the SAX parser: there is no need to use a "tiny XML
>>> scanner" (whatever that is), and building a DOM for a part of the
>>> document in your SAX handler is easy if needed (for the OP's case a
>>> simple state machine would be enough, probably).
>>
>> An example is <URL:
>> http://en.wikipedia.org/wiki/XMPP#XMPP_via_HTTP_and_WebSocket_transports>.
>>
>> The "document" is potentially infinitely long. The elements are
>> messages.
>>
>> The programmer would rather process the elements as DOM trees than
>> follow the meandering SAX parser.
> 
> In fact, the best thing would be if the DOM parser supported the use
> case out of the box: give the partial, whole or oversize document to the
> parser. If the document isn't complete, the parser should indicate the
> need for more input. If there are bytes after the document is
> successfully finished, the parser should leave the excess bytes in the
> pipeline.
> 
> IOW, if the DOM parser knows full well where the document ends, why
> must the application tell it to it?

In fact, XMPP traffic has a root element. And I agree that a tree is much
easier to handle than SAX events. ElementTree has gained a nice API in
Py3.4 that supports this in a much saner way than SAX, using iterators.
Basically, you just dump in some data that you received and get back an
iterator over the elements (and their subtrees) that it generated from it.
Intercept on the right top elements and you get your next subtree as soon
as it's ready.

https://docs.python.org/3.4/library/xml.etree.elementtree.html#pull-api-for-non-blocking-parsing

It's also supported by recent versions of lxml, which additionally has easy
to use support for the sending side with its xmlfile() tool.

http://lxml.de/parsing.html#incremental-event-parsing

http://lxml.de/api.html#incremental-xml-generation

Stefan

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

parsing multiple root element XML into text Percy Tambunan <percy.tambunan@gmail.com> - 2014-05-09 01:59 -0700
  Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 12:01 +0300
  Re: parsing multiple root element XML into text Chris Angelico <rosuav@gmail.com> - 2014-05-09 19:02 +1000
    Re: parsing multiple root element XML into text Percy Tambunan <percy.tambunan@gmail.com> - 2014-05-11 21:12 -0700
      Re: parsing multiple root element XML into text Peter Otten <__peter__@web.de> - 2014-05-12 10:22 +0200
  Re: parsing multiple root element XML into text Stefan Behnel <stefan_ml@behnel.de> - 2014-05-09 11:13 +0200
  Re: parsing multiple root element XML into text Chris Angelico <rosuav@gmail.com> - 2014-05-09 19:15 +1000
  Re: parsing multiple root element XML into text Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2014-05-09 11:51 +0200
    Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 13:33 +0300
      Re: parsing multiple root element XML into text Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2014-05-09 14:01 +0200
        Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 15:31 +0300
          Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 15:38 +0300
            Re: parsing multiple root element XML into text Stefan Behnel <stefan_ml@behnel.de> - 2014-05-09 15:55 +0200
              Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 18:29 +0300
            Re: parsing multiple root element XML into text Burak Arslan <burak.arslan@arskom.com.tr> - 2014-05-09 19:52 +0300
            Re: parsing multiple root element XML into text Stefan Behnel <stefan_ml@behnel.de> - 2014-05-09 21:51 +0200
          Re: parsing multiple root element XML into text Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2014-05-09 17:50 +0200
            Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 19:15 +0300
              Re: parsing multiple root element XML into text Alain Ketterlin <alain@dpt-info.u-strasbg.fr> - 2014-05-09 19:16 +0200
                Re: parsing multiple root element XML into text Marko Rauhamaa <marko@pacujo.net> - 2014-05-09 21:04 +0300
                Re: parsing multiple root element XML into text Stefan Behnel <stefan_ml@behnel.de> - 2014-05-09 21:46 +0200

csiph-web