Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Fri, 09 May 2014 19:52:47 +0300
From: Burak Arslan <burak.arslan@arskom.com.tr>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: Stefan Behnel <stefan_ml@behnel.de>, python-list@python.org
Subject: Re: parsing multiple root element XML into text
References: <0e5e9a24-3663-4293-a530-239486cf28fc@googlegroups.com> <87oaz7uvo4.fsf@dpt-info.u-strasbg.fr> <87a9arfdha.fsf@elektro.pacujo.net> <87k39vupnc.fsf@dpt-info.u-strasbg.fr> <8738gjf813.fsf@elektro.pacujo.net> <87y4ybdt46.fsf@elektro.pacujo.net> <lkimp0$428$1@ger.gmane.org>
In-Reply-To: <lkimp0$428$1@ger.gmane.org>
Content-Type: multipart/alternative; boundary="------------060609000001020106000602"
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.9825.1399654376.18130.python-list@python.org>
Lines: 99
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:71181

This is a multi-part message in MIME format.
--------------060609000001020106000602
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit


On 05/09/14 16:55, Stefan Behnel wrote:
> ElementTree has gained a nice API in
> Py3.4 that supports this in a much saner way than SAX, using iterators.
> Basically, you just dump in some data that you received and get back an
> iterator over the elements (and their subtrees) that it generated from it.
> Intercept on the right top elements and you get your next subtree as soon
> as it's ready.


Hi Stefan,

Here's a small script:

    events = etree.iterparse(istr, events=("start", "end"))
    stack = deque()
    for event, element in events:
    if event == "start":
    stack.append(element)
    elif event == "end":
    stack.pop()
     
    if len(stack) == 0:
    break
     
    print(istr.tell(), "%5s, %4s, %s" % (event, element.tag, element.text))

where istr is an input-stream. (Fully working example:
https://gist.github.com/plq/025005a71e8135c46800)

I was expecting to have istr.tell() return the position where the first
root element ends, which would make it possible to continue parsing with
another call to etree.iterparse(). But istr.tell() returns the position
of EOF after the first call to next() on the iterator it returns.
Without the stack check, the loop eventually throws an exception and the
offset value in that exception is None.

So I'm lost here, how it'd possible to parse OP's document with lxml?

Best,
Burak

--------------060609000001020106000602
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    <div class="moz-cite-prefix">On 05/09/14 16:55, Stefan Behnel wrote:<br>
    </div>
    <blockquote cite="mid:lkimp0$428$1@ger.gmane.org" type="cite">
      <pre wrap="">ElementTree has gained a nice API in
Py3.4 that supports this in a much saner way than SAX, using iterators.
Basically, you just dump in some data that you received and get back an
iterator over the elements (and their subtrees) that it generated from it.
Intercept on the right top elements and you get your next subtree as soon
as it's ready.</pre>
    </blockquote>
    <br>
    <br>
    Hi Stefan,<br>
    <br>
    Here's a small script:<br>
    <blockquote>
      <pre class="line-pre"><div class="line" id="file-incparsexml-py-LC23"><span class="n">events</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">iterparse</span><span class="p">(</span><span class="n">istr</span><span class="p">,</span> <span class="n">events</span><span class="o">=</span><span class="p">(</span><span class="s">"start"</span><span class="p">,</span> <span class="s">"end"</span><span class="p">))</span></div><div class="line" id="file-incparsexml-py-LC24"><span class="n">stack</span> <span class="o">=</span> <span class="n">deque</span><span class="p">()</span></div><div class="line" id="file-incparsexml-py-LC25"><span class="k">for</span> <span class="n">event</span><span class="p">,</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">events</span><span class="p">:</span></div><div class="line" id="file-incparsexml-py-LC26">  <span class="k">if</span> <span class="n">event</span> <span
  
class="o">==</span> <span class="s">"start"</span><span class="p">:</span></div><div class="line" id="file-incparsexml-py-LC27">    <span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">element</span><span class="p">)</span></div><div class="line" id="file-incparsexml-py-LC28">  <span class="k">elif</span> <span class="n">event</span> <span class="o">==</span> <span class="s">"end"</span><span class="p">:</span></div><div class="line" id="file-incparsexml-py-LC29">    <span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span></div><div class="line" id="file-incparsexml-py-LC30">&nbsp;</div><div class="line" id="file-incparsexml-py-LC31">  <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">stack</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span></div><div class="line" id="file-i
 n
cparsexml-py-LC32">    <span class="k">break</span></div><div class="line" id="file-incparsexml-py-LC33">&nbsp;</div><div class="line" id="file-incparsexml-py-LC34">  <span class="k">print</span><span class="p">(</span><span class="n">istr</span><span class="o">.</span><span class="n">tell</span><span class="p">(),</span> <span class="s">"</span><span class="si">%5s</span><span class="s">, </span><span class="si">%4s</span><span class="s">, </span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">element</span><span class="o">.</span><span class="n">tag</span><span class="p">,</span> <span class="n">element</span><span class="o">.</span><span class="n">text</span><span class="p">))</span></div></pre>
    </blockquote>
    where istr is an input-stream. (Fully working example:
    <a class="moz-txt-link-freetext" href="https://gist.github.com/plq/025005a71e8135c46800">https://gist.github.com/plq/025005a71e8135c46800</a>)<br>
    <br>
    I was expecting to have istr.tell() return the position where the
    first root element ends, which would make it possible to continue
    parsing with another call to etree.iterparse(). But istr.tell()
    returns the position of EOF after the first call to next() on the
    iterator it returns. Without the stack check, the loop eventually
    throws an exception and the offset value in that exception is None.<br>
    <br>
    So I'm lost here, how it'd possible to parse OP's document with
    lxml?<br>
    <br>
    Best,<br>
    Burak<br>
  </body>
</html>

--------------060609000001020106000602--