Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'elif': 0.05; 'root': 0.05; 'subject:text': 0.05; 'element': 0.07; '%s"': 0.09; 'check,': 0.09; 'parsing': 0.09; 'subject:into': 0.09; 'subject:parsing': 0.09; 'throws': 0.09; 'url:github': 0.09; 'api': 0.11; 'burak': 0.16; 'dump': 0.16; 'ends,': 0.16; 'eof': 0.16; 'expecting': 0.16; 'from:addr:arskom.com.tr': 0.16; 'from:addr:burak.arslan': 0.16; 'from:name:burak arslan': 0.16; 'gained': 0.16; 'iterator': 0.16; 'message-id:@arskom.com.tr': 0.16; 'none.': 0.16; 'received:arskomhosting.com': 0.16; 'subject:XML': 0.16; 'elements': 0.16; 'exception': 0.16; 'wrote:': 0.18; 'stack': 0.19; 'stefan': 0.19; 'header:User- Agent:1': 0.23; 'parse': 0.24; 'header:In-Reply-To:1': 0.27; 'to:2**1': 0.27; "i'm": 0.30; 'another': 0.32; 'but': 0.35; 'event,': 0.36; 'next': 0.36; 'possible': 0.36; 'to:addr:python- list': 0.38; 'to:addr:python.org': 0.39; 'how': 0.40; 'eventually': 0.60; 'break': 0.61; 'lost': 0.61; 'first': 0.61; 'back': 0.62; 'soon': 0.63; 'events:': 0.84; "it'd": 0.84; 'returns.': 0.84 Date: Fri, 09 May 2014 19:52:47 +0300 From: Burak Arslan User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Stefan Behnel , python-list@python.org Subject: Re: parsing multiple root element XML into text References: <0e5e9a24-3663-4293-a530-239486cf28fc@googlegroups.com> <87oaz7uvo4.fsf@dpt-info.u-strasbg.fr> <87a9arfdha.fsf@elektro.pacujo.net> <87k39vupnc.fsf@dpt-info.u-strasbg.fr> <8738gjf813.fsf@elektro.pacujo.net> <87y4ybdt46.fsf@elektro.pacujo.net> In-Reply-To: Content-Type: multipart/alternative; boundary="------------060609000001020106000602" X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 99 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1399654376 news.xs4all.nl 2918 [2001:888:2000:d::a6]:41506 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:71181 This is a multi-part message in MIME format. --------------060609000001020106000602 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 05/09/14 16:55, Stefan Behnel wrote: > ElementTree has gained a nice API in > Py3.4 that supports this in a much saner way than SAX, using iterators. > Basically, you just dump in some data that you received and get back an > iterator over the elements (and their subtrees) that it generated from it. > Intercept on the right top elements and you get your next subtree as soon > as it's ready. Hi Stefan, Here's a small script: events = etree.iterparse(istr, events=("start", "end")) stack = deque() for event, element in events: if event == "start": stack.append(element) elif event == "end": stack.pop() if len(stack) == 0: break print(istr.tell(), "%5s, %4s, %s" % (event, element.tag, element.text)) where istr is an input-stream. (Fully working example: https://gist.github.com/plq/025005a71e8135c46800) I was expecting to have istr.tell() return the position where the first root element ends, which would make it possible to continue parsing with another call to etree.iterparse(). But istr.tell() returns the position of EOF after the first call to next() on the iterator it returns. Without the stack check, the loop eventually throws an exception and the offset value in that exception is None. So I'm lost here, how it'd possible to parse OP's document with lxml? Best, Burak --------------060609000001020106000602 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
On 05/09/14 16:55, Stefan Behnel wrote:
ElementTree has gained a nice API in
Py3.4 that supports this in a much saner way than SAX, using iterators.
Basically, you just dump in some data that you received and get back an
iterator over the elements (and their subtrees) that it generated from it.
Intercept on the right top elements and you get your next subtree as soon
as it's ready.


Hi Stefan,

Here's a small script:
events = etree.iterparse(istr, events=("start", "end"))
stack = deque()
for event, element in events:
if event == "start":
stack.append(element)
elif event == "end":
stack.pop()
 
if len(stack) == 0:
break
 
print(istr.tell(), "%5s, %4s, %s" % (event, element.tag, element.text))
where istr is an input-stream. (Fully working example: https://gist.github.com/plq/025005a71e8135c46800)

I was expecting to have istr.tell() return the position where the first root element ends, which would make it possible to continue parsing with another call to etree.iterparse(). But istr.tell() returns the position of EOF after the first call to next() on the iterator it returns. Without the stack check, the loop eventually throws an exception and the offset value in that exception is None.

So I'm lost here, how it'd possible to parse OP's document with lxml?

Best,
Burak
--------------060609000001020106000602--