Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'subject:file': 0.07; 'subject:two': 0.07; 'versions.': 0.07; '34,': 0.09; 'iterate': 0.09; 'performs': 0.09; 'cc:addr:python- list': 0.11; 'python': 0.11; 'suggest': 0.14; 'wrote': 0.14; '"*"': 0.16; '3):': 0.16; 'benjamin': 0.16; 'caching': 0.16; 'cc:name:python list': 0.16; 'descriptor': 0.16; 'descriptors': 0.16; 'garbage': 0.16; 'iterator': 0.16; 'itertools': 0.16; 'objects.': 0.16; 'programmers.': 0.16; 'received:166.84': 0.16; 'received:166.84.1': 0.16; 'received:166.84.1.89': 0.16; 'received:24.136': 0.16; 'received:mailbackend.panix.com': 0.16; 'received:panix.com': 0.16; 'rgb(255,': 0.16; 'roy': 0.16; 'true:': 0.16; 'whitespace.': 0.16; 'helvetica;': 0.16; 'wrote:': 0.18; 'basically': 0.19; 'cheap': 0.19; 'received:166': 0.19; 'import': 0.22; 'cc:addr:python.org': 0.22; 'print': 0.22; '---': 0.24; 'cc:2**0': 0.24; 'holds': 0.26; 'header:In-Reply-To:1': 0.27; 'received:24': 0.27; 'idea': 0.28; '0);': 0.29; 'am,': 0.29; 'medium;': 0.30; 'easier': 0.31; 'lines': 0.31; '255,': 0.31; 'sep': 0.31; 'file': 0.32; 'agreed': 0.32; 'skip:- 30': 0.32; 'another': 0.32; 'open': 0.33; 'url:python': 0.33; 'rgb(0,': 0.33; "i'd": 0.34; 'subject:with': 0.35; 'but': 0.35; 'there': 0.35; 'version': 0.36; 'doing': 0.36; 'possible': 0.36; 'url:org': 0.36; 'application': 0.37; 'skip:- 20': 0.37; 'two': 0.37; 'auto;': 0.38; 'minimum': 0.38; 'problems': 0.38; 'url:library': 0.38; 'expect': 0.39; 'how': 0.40; 'dave': 0.60; 'most': 0.60; 'break': 0.61; "you're": 0.61; 'header:Message-Id:1': 0.63; 'costs': 0.63; 'such': 0.63; 'more': 0.64; 'talking': 0.65; 'temporary': 0.65; 'to:addr:gmail.com': 0.65; 'life': 0.66; 'believe': 0.68; 'smith': 0.68; 'arial,': 0.74; 'helvetica,': 0.74; 'inline': 0.74; 'sans- serif;': 0.78; 'around,': 0.84; 'email addr:panix.com': 0.84; 'hanging': 0.84; 'oscar': 0.84; 'subject:over': 0.84; '2013,': 0.91; 'angel': 0.91 Subject: Re: iterating over a file with two pointers Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: multipart/alternative; boundary="Apple-Mail=_66FF0CA8-D298-491F-985E-28A0C6305CA4" From: Roy Smith In-Reply-To: Date: Wed, 18 Sep 2013 10:36:43 -0400 References: <3018b3d4-f914-4c89-9f26-cd4b2af32e73@googlegroups.com> To: Oscar Benjamin X-Mailer: Apple Mail (2.1283) Cc: Python List X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 316 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1379515006 news.xs4all.nl 15908 [2001:888:2000:d::a6]:58209 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:54382 --Apple-Mail=_66FF0CA8-D298-491F-985E-28A0C6305CA4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > Dave Angel wrote (and I agreed with): >> I'd suggest you open the file twice, and get two file objects. Then = you >> can iterate over them independently. On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote: > There's no need to use OS resources by opening the file twice or to > screw up the IO caching with seek(). There's no reason NOT to use OS resources. That's what the OS is there = for; to make life easier on application programmers. Opening a file = twice costs almost nothing. File descriptors are almost as cheap as = whitespace. > Peter's version holds just as many lines as is necessary in an > internal Python buffer and performs the minimum possible > amount of IO. I believe by "Peter's version", you're talking about: > from itertools import islice, tee=20 >=20 > with open("tmp.txt") as f:=20 > while True:=20 > for outer in f:=20 > print outer,=20 > if "*" in outer:=20 > f, g =3D tee(f)=20 > for inner in islice(g, 3):=20 > print " ", inner,=20 > break=20 > else:=20 > break=20 There's this note from = http://docs.python.org/2.7/library/itertools.html#itertools.tee: > This itertool may require significant auxiliary storage (depending on = how much temporary data needs to be stored). In general, if one iterator = uses most or all of the data before another iterator starts, it is = faster to use list() instead of tee(). I have no idea how that interacts with the pattern above where you call = tee() serially. You're basically doing with open("my_file") as f: while True: f, g =3D tee(f) Are all of those g's just hanging around, eating up memory, while = waiting to be garbage collected? I have no idea. But I do know that no = such problems exist with the two file descriptor versions. > I would expect this to be more > efficient as well as less error-prone on Windows. >=20 >=20 > Oscar >=20 --- Roy Smith roy@panix.com --Apple-Mail=_66FF0CA8-D298-491F-985E-28A0C6305CA4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 Dave Angel <davea@davea.name> wrote (and I = agreed with):
I'd suggest you open the file twice, and get two file = objects.  Then you
can iterate over them = independently.

On = Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
There's no need to use OS resources by opening the = file twice or to
screw up the IO caching with = seek().

There's no reason NOT to = use OS resources.  That's what the OS is there for; to make life = easier on application programmers.  Opening a file twice costs = almost nothing.  File descriptors are almost as cheap as = whitespace.

Peter's = version holds just as many lines as is necessary in = an
internal Python = buffer and performs the minimum = possible
amount of = IO.

I believe by "Peter's = version", you're talking = about:

from itertools import islice, = tee 

with open("tmp.txt") as = f: 
    while = True: 
      =   for outer in f: 
            print = outer, 
      =       if "*" in outer: 
                f, g =3D = tee(f) 
      =           for inner in islice(g, = 3): 
      =               print "   ", = inner, 
      =           break 
        else: 
            = break 


There's= this note from h= ttp://docs.python.org/2.7/library/itertools.html#itertools.tee:<= div>
This itertool may require = significant auxiliary storage (depending on how much temporary data = needs to be stored). In general, if one iterator uses most or all = of the data before another iterator starts, it is faster to = use list() instead = of tee().

I have = no idea how that interacts with the pattern above where you call tee() = serially.  You're basically doing

with = open("my_file") as f:
while True:
f, g =3D = tee(f)

Are all of those g's just hanging = around, eating up memory, while waiting to be garbage collected?  I = have no idea.  But I do know that no such problems exist with the = two file descriptor = versions.




=

I would expect this to be = more
efficient as well as less error-prone on = Windows.


Oscar



---
Roy Smith

=


= --Apple-Mail=_66FF0CA8-D298-491F-985E-28A0C6305CA4--