Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #54418

Re: iterating over a file with two pointers

From Peter Otten <__peter__@web.de>
Subject Re: iterating over a file with two pointers
Date 2013-09-19 09:23 +0200
Organization None
References (1 earlier) <CAPTjJmoyrJqVR29MeDzcfA9K=gGgHSuqO3uCNXGLQs7APLJByA@mail.gmail.com> <mailman.115.1379504419.18130.python-list@python.org> <roy-B13238.08561818092013@news.panix.com> <CAHVvXxQa6rsrD669kL-EeqCQFn3jKH-k=eWY5iey4RwVBD2RiA@mail.gmail.com> <52B7F7EA-C7C4-4DB6-A93C-25F4C058EB58@panix.com>
Newsgroups comp.lang.python
Message-ID <mailman.146.1379575403.18130.python-list@python.org> (permalink)

Show all headers | View raw


Roy Smith wrote:

>> Dave Angel <davea@davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects.  Then you
>>> can iterate over them independently.
> 
> 
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
> 
> There's no reason NOT to use OS resources.  That's what the OS is there
> for; to make life easier on application programmers.  Opening a file twice
> costs almost nothing.  File descriptors are almost as cheap as whitespace.
> 
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
> 
> I believe by "Peter's version", you're talking about:
> 
>> from itertools import islice, tee
>> 
>> with open("tmp.txt") as f:
>>     while True:
>>         for outer in f:
>>             print outer,
>>             if "*" in outer:
>>                 f, g = tee(f)
>>                 for inner in islice(g, 3):
>>                     print "   ", inner,
                   del g # a good idea in the general case
>>                 break
>>         else:
>>             break
> 
> 
> There's this note from
> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
> 
>> This itertool may require significant auxiliary storage (depending on how
>> much temporary data needs to be stored). In general, if one iterator uses
>> most or all of the data before another iterator starts, it is faster to
>> use list() instead of tee().
> 
> 
> I have no idea how that interacts with the pattern above where you call
> tee() serially.  

As I understand it the above says that

items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
   pass
for pair in izip(a, b):
    pass

stores 1000 items and can go on forever, but

items = infinite()
a, b = tee(items)
for item in a:
    pass

will consume unbounded memory and that if items is finite using a list 
instead of tee is more efficient. The documentation says nothing about

items = infinite()
a, b = tee(items)
del a
for item in b:
   pass

so you have to trust Mr Hettinger or come up with a test case...

> You're basically doing
> 
> with open("my_file") as f:
> while True:
>     f, g = tee(f)
> 
> Are all of those g's just hanging around, eating up memory, while waiting
> to be garbage collected?  I have no idea.  

I'd say you've just devised a nice test to find out ;)

> But I do know that no such
> problems exist with the two file descriptor versions.

The trade-offs are different. My version works with arbitrary iterators 
(think stdin), but will consume unbounded amounts of memory when the inner 
loop doesn't stop.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 04:12 -0700
  Re: iterating over a file with two pointers Chris Angelico <rosuav@gmail.com> - 2013-09-18 21:21 +1000
    Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:07 -0700
      Re: iterating over a file with two pointers Travis Griggs <travisgriggs@gmail.com> - 2013-09-18 09:18 -0700
  Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 11:39 +0000
    Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 08:56 -0400
      Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-18 14:09 +0100
      Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 10:36 -0400
      Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 20:07 +0000
      Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 09:23 +0200
      Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:16 +0100
      Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 16:38 +0200
      Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:48 +0100
  Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 13:44 +0200
    Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:14 -0700
      Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 14:54 +0200
      Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:40 +0000
  Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:56 +0000
    Re: iterating over a file with two pointers Joshua Landau <joshua@landau.ws> - 2013-09-19 08:04 +0100

csiph-web