Groups > comp.lang.python > #54369 > unrolled thread

iterating over a file with two pointers

Started by	nikhil Pandey <nikhilpandey90@gmail.com>
First post	2013-09-18 04:12 -0700
Last post	2013-09-19 08:04 +0100
Articles	19 — 9 participants

Back to article view | Back to comp.lang.python

  iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 04:12 -0700
    Re: iterating over a file with two pointers Chris Angelico <rosuav@gmail.com> - 2013-09-18 21:21 +1000
      Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:07 -0700
        Re: iterating over a file with two pointers Travis Griggs <travisgriggs@gmail.com> - 2013-09-18 09:18 -0700
    Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 11:39 +0000
      Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 08:56 -0400
        Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-18 14:09 +0100
        Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 10:36 -0400
        Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 20:07 +0000
        Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 09:23 +0200
        Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:16 +0100
        Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 16:38 +0200
        Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:48 +0100
    Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 13:44 +0200
      Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:14 -0700
        Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 14:54 +0200
        Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:40 +0000
    Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:56 +0000
      Re: iterating over a file with two pointers Joshua Landau <joshua@landau.ws> - 2013-09-19 08:04 +0100

#54369 — iterating over a file with two pointers

From	nikhil Pandey <nikhilpandey90@gmail.com>
Date	2013-09-18 04:12 -0700
Subject	iterating over a file with two pointers
Message-ID	<3018b3d4-f914-4c89-9f26-cd4b2af32e73@googlegroups.com>

hi,
I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
please help. I am stuck up on this.

[toc] | [next] | [standalone]

#54370

From	Chris Angelico <rosuav@gmail.com>
Date	2013-09-18 21:21 +1000
Message-ID	<mailman.113.1379503314.18130.python-list@python.org>
In reply to	#54369

On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
> hi,
> I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
> so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
> please help. I am stuck up on this.

After the inner loop finishes, do you want to go back to where the
outer loop left off, or should the outer loop continue from the point
where the inner loop stopped? In other words, do you want to locate
overlapping sections, or not? Both are possible, but the solutions
will look somewhat different.

ChrisA

[toc] | [prev] | [next] | [standalone]

#54375

From	nikhil Pandey <nikhilpandey90@gmail.com>
Date	2013-09-18 05:07 -0700
Message-ID	<e30d9950-7b29-43ed-b85b-455a8d0e9fee@googlegroups.com>
In reply to	#54370

On Wednesday, September 18, 2013 4:51:51 PM UTC+5:30, Chris Angelico wrote:
> On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
> 
> > hi,
> 
> > I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
> 
> > so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
> 
> > please help. I am stuck up on this.
> 
> 
> 
> After the inner loop finishes, do you want to go back to where the
> 
> outer loop left off, or should the outer loop continue from the point
> 
> where the inner loop stopped? In other words, do you want to locate
> 
> overlapping sections, or not? Both are possible, but the solutions
> 
> will look somewhat different.
> 
> 
> 
> ChrisA

Hi Chris,
After the inner loop finishes, I want to go back to the next line from where the outer loop was left i.e the lines of the inner loop will be traversed again in the outer loop.
1>>I iterate over lines of the file
2>> when i find a match in a certain line, i start another loop till some condition is met in the subsequent lines
3>> then i come back to where i left and repeat 1(ideally i want to delete that line in inner loop where that condition is met, but even if it is not deleted, its OK)

[toc] | [prev] | [next] | [standalone]

#54388

From	Travis Griggs <travisgriggs@gmail.com>
Date	2013-09-18 09:18 -0700
Message-ID	<mailman.126.1379521144.18130.python-list@python.org>
In reply to	#54375

On Sep 18, 2013, at 5:07 AM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:

> On Wednesday, September 18, 2013 4:51:51 PM UTC+5:30, Chris Angelico wrote:
>> On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
>> 
>>> hi,
>> 
>>> I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
>> 
>>> so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
>> 
>>> please help. I am stuck up on this.
>> 
>> 
>> 
>> After the inner loop finishes, do you want to go back to where the
>> 
>> outer loop left off, or should the outer loop continue from the point
>> 
>> where the inner loop stopped? In other words, do you want to locate
>> 
>> overlapping sections, or not? Both are possible, but the solutions
>> 
>> will look somewhat different.
>> 
>> 
>> 
>> ChrisA
> 
> Hi Chris,
> After the inner loop finishes, I want to go back to the next line from where the outer loop was left i.e the lines of the inner loop will be traversed again in the outer loop.
> 1>>I iterate over lines of the file
> 2>> when i find a match in a certain line, i start another loop till some condition is met in the subsequent lines
> 3>> then i come back to where i left and repeat 1(ideally i want to delete that line in inner loop where that condition is met, but even if it is not deleted, its OK)


Just curious, do you really need two loops and file handles? Without better details about what you're really doing, but as you've provided more detail, it seems to me that just iterating the lines of the file, and using a latch boolean to indicate when you should do additional processing on lines might be easier. I modified Chris's example input to look like:

alpha
*beta
gamma+
delta
epsilon
zeta
*eta
kappa
tau
pi+
omicron

And then shot it with the following:

#!/usr/bin/env python3
with open("samplein.txt") as file:
    reversing = False
    for line in (raw.strip() for raw in file):
        if reversing:
            print('____', line[::-1], '____')
            reversing = not line.endswith('+')
        else:
            print(line)
            reversing = line.startswith('*')

Which begins reversing lines as its working through them, until a different condition is met.

Travis Griggs

[toc] | [prev] | [next] | [standalone]

#54372

From	Dave Angel <davea@davea.name>
Date	2013-09-18 11:39 +0000
Message-ID	<mailman.115.1379504419.18130.python-list@python.org>
In reply to	#54369

On 18/9/2013 07:21, Chris Angelico wrote:

> On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
>> hi,
>> I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
>> so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
>> please help. I am stuck up on this.
>
> After the inner loop finishes, do you want to go back to where the
> outer loop left off, or should the outer loop continue from the point
> where the inner loop stopped? In other words, do you want to locate
> overlapping sections, or not? Both are possible, but the solutions
> will look somewhat different.
>

In addition, is this really a text file?  For binary files, you could
use seek(), and manage things yourself.  But that's not strictly legal
in a text file, and may work on one system, not on another.

I'd suggest you open the file twice, and get two file objects.  Then you
can iterate over them independently.

Or if the file is under a few hundred meg, just do a readlines, and do
the two iterators over that.  That way, the inner loop could just
iterate over a simple slice.

infile = open(....  "rb")
lines = infile.readlines()
infile.close()

for index, line in enumerate(lines):
    for inner in lines[index+1:20]:
         ...

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#54379

From	Roy Smith <roy@panix.com>
Date	2013-09-18 08:56 -0400
Message-ID	<roy-B13238.08561818092013@news.panix.com>
In reply to	#54372

> > On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> 
> > wrote:
> >> hi,
> >> I want to iterate over the lines of a file and when i find certain lines, 
> >> i need another loop starting from the next of that "CERTAIN" line till a 
> >> few (say 20) lines later.
> >> so, basically i need two pointers to lines (one for outer loop(for each 
> >> line in file)) and one for inner loop. How can i do that in python?
> >> please help. I am stuck up on this.
> [...]

In article <mailman.115.1379504419.18130.python-list@python.org>,
 Dave Angel <davea@davea.name> wrote:
[I hope I unwound the multi-layer quoting right]
> In addition, is this really a text file?  For binary files, you could
> use seek(), and manage things yourself.  But that's not strictly legal
> in a text file, and may work on one system, not on another.

Why is seek() not legal on a text file?  The only issue I'm aware of is 
the note at http://docs.python.org/2/library/stdtypes.html, which says:

"On Windows, tell() can return illegal values (after an fgets()) when 
reading files with Unix-style line-endings. Use binary mode ('rb') to 
circumvent this problem."

so, don't do that (i.e. read unix-line-terminated files on windows).  
But assuming you're not in that situation, it seems like something like 
this this should work:

> I'd suggest you open the file twice, and get two file objects.  Then you
> can iterate over them independently.

and use tell() to keep them in sync.  Something along the lines of (not 
tested):

f1 = open("my_file")
f2 = open("my_file")

while True:
   where = f1.tell()
   line = f1.readline()
   if not line:
      break
   if matches_pattern(line):
      f2.seek(where)
      for i in range(20):
         line = f2.readline()
         print line

Except for the specific case noted above (i.e. reading a unix file on a 
windows box, so don't do that), it doesn't matter that seek() does funny 
things with windows line endings, because tell() does the same funny 
things.  Doing f2.seek(f1.tell()) will get the two file pointers into 
the same place in both files.

[toc] | [prev] | [next] | [standalone]

#54380

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-09-18 14:09 +0100
Message-ID	<mailman.119.1379509785.18130.python-list@python.org>
In reply to	#54379

On 18 September 2013 13:56, Roy Smith <roy@panix.com> wrote:
>
>> > On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com>
>> > wrote:
>> >> hi,
>> >> I want to iterate over the lines of a file and when i find certain lines,
>> >> i need another loop starting from the next of that "CERTAIN" line till a
>> >> few (say 20) lines later.
>> >> so, basically i need two pointers to lines (one for outer loop(for each
>> >> line in file)) and one for inner loop. How can i do that in python?
>> >> please help. I am stuck up on this.
>> [...]
>
> In article <mailman.115.1379504419.18130.python-list@python.org>,
>  Dave Angel <davea@davea.name> wrote:
> [I hope I unwound the multi-layer quoting right]
>> In addition, is this really a text file?  For binary files, you could
>> use seek(), and manage things yourself.  But that's not strictly legal
>> in a text file, and may work on one system, not on another.
>
> Why is seek() not legal on a text file?  The only issue I'm aware of is
> the note at http://docs.python.org/2/library/stdtypes.html, which says:
>
> "On Windows, tell() can return illegal values (after an fgets()) when
> reading files with Unix-style line-endings. Use binary mode ('rb') to
> circumvent this problem."
>
> so, don't do that (i.e. read unix-line-terminated files on windows).
> But assuming you're not in that situation, it seems like something like
> this this should work:
>
>> I'd suggest you open the file twice, and get two file objects.  Then you
>> can iterate over them independently.

There's no need to use OS resources by opening the file twice or to
screw up the IO caching with seek(). Peter's version holds just as
many lines as is necessary in an internal Python buffer and performs
the minimum possible amount of IO. I would expect this to be more
efficient as well as less error-prone on Windows.


Oscar

[toc] | [prev] | [next] | [standalone]

#54382

From	Roy Smith <roy@panix.com>
Date	2013-09-18 10:36 -0400
Message-ID	<mailman.120.1379515006.18130.python-list@python.org>
In reply to	#54379

[Multipart message — attachments visible in raw view] — view raw

> Dave Angel <davea@davea.name> wrote (and I agreed with):
>> I'd suggest you open the file twice, and get two file objects.  Then you
>> can iterate over them independently.

On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
> There's no need to use OS resources by opening the file twice or to
> screw up the IO caching with seek().

There's no reason NOT to use OS resources.  That's what the OS is there for; to make life easier on application programmers.  Opening a file twice costs almost nothing.  File descriptors are almost as cheap as whitespace.

> Peter's version holds just as many lines as is necessary in an
> internal Python buffer and performs the minimum possible
> amount of IO.

I believe by "Peter's version", you're talking about:

> from itertools import islice, tee 
> 
> with open("tmp.txt") as f: 
>     while True: 
>         for outer in f: 
>             print outer, 
>             if "*" in outer: 
>                 f, g = tee(f) 
>                 for inner in islice(g, 3): 
>                     print "   ", inner, 
>                 break 
>         else: 
>             break 

There's this note from http://docs.python.org/2.7/library/itertools.html#itertools.tee:

> This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

I have no idea how that interacts with the pattern above where you call tee() serially.  You're basically doing

with open("my_file") as f:
while True:
	f, g = tee(f)

Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected?  I have no idea.  But I do know that no such problems exist with the two file descriptor versions.

> I would expect this to be more
> efficient as well as less error-prone on Windows.
> 
> 
> Oscar
> 

---
Roy Smith
roy@panix.com

[toc] | [prev] | [next] | [standalone]

#54399

From	Dave Angel <davea@davea.name>
Date	2013-09-18 20:07 +0000
Message-ID	<mailman.134.1379534879.18130.python-list@python.org>
In reply to	#54379

On 18/9/2013 10:36, Roy Smith wrote:

>> Dave Angel <davea@davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects.  Then you
>>> can iterate over them independently.
>
>
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
>
> There's no reason NOT to use OS resources.  That's what the OS is there for; to make life easier on application programmers.  Opening a file twice costs almost nothing.  File descriptors are almost as cheap as whitespace.
>
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
>
> I believe by "Peter's version", you're talking about:
>
>> from itertools import islice, tee 
>> 
>> with open("tmp.txt") as f: 
>>     while True: 
>>         for outer in f: 
>>             print outer, 
>>             if "*" in outer: 
>>                 f, g = tee(f) 
>>                 for inner in islice(g, 3): 
>>                     print "   ", inner, 
>>                 break 
>>         else: 
>>             break 
>
>
> There's this note from http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>
>> This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
>
>
> I have no idea how that interacts with the pattern above where you call tee() serially.  You're basically doing
>
> with open("my_file") as f:
> while True:
> 	f, g = tee(f)
>
> Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected?  I have no idea.  But I do know that no such problems exist with the two file descriptor versions.
>
>
>
>
>
>
>> I would expect this to be more
>> efficient as well as less error-prone on Windows.
>> 
>> 
>> Oscar
>> 
>
>
> ---
> Roy Smith
> roy@panix.com
>
>
>
>
>
> <html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><div><blockquote type="cite">Dave Angel &lt;<a href="mailto:davea@davea.name">davea@davea.name</a>&gt; wrote (and I agreed with):<br></blockquote><blockquote type="cite"><blockquote type="cite">I'd suggest you open the file twice, and get two file objects. &nbsp;Then you<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">can iterate over them independently.<br></blockquote></blockquote></div><div><br></div><div>On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:</div><blockquote type="cite"><div>There's no need to use OS resources by opening the file twice or to<br>screw up the IO caching with seek().</div></blockquote><div><br></div><div>There's no reason NOT to use OS resources. &nbsp;That's what the OS is there for; to make life easier on application programmers. &nbsp;Opening a file twice costs almost nothing. &nbsp;File descriptors are almost as cheap as whitespace.</div><div><br></div><blockquote type="cite"><div>Peter's version holds just as&nbsp;many lines as is necessary in an</div></blockquote><blockquote type="cite"><div>internal Python buffer and performs&nbsp;the minimum possible</div></blockquote><blockquote type="cite"><div>amount of IO.</div></blockquote><div><br></div><div>I believe by "Peter's version", you're talking about:</div><div><br></div><div></div><blockquote type="cite"><div><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">from itertools import islice, tee&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">with open("tmp.txt") as f:&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; while True:&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; for outer in f:&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print outer,&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if "*" in outer:&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; f, g = tee(f)&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for inner in islice(g, 3):&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print " &nbsp; ", inner,&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp;</span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp;</span></div></blockquote><div><br></div><div><br></div>There's this note from&nbsp;<a href="http://docs.python.org/2.7/library/itertools.html#itertools.tee">http://docs.python.org/2.7/library/itertools.html#itertools.tee</a>:</div><div><br></div><div><blockquote type="cite">This itertool may require significant auxiliary storage (depending on how much temporary data needs to be&nbsp;stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use&nbsp;list()&nbsp;instead of&nbsp;tee().<span></span></blockquote></div><div><br></div><div>I have no idea how that interacts with the pattern above where you call tee() serially. &nbsp;You're basically doing</div><div><br></div><div>with open("my_file") as f:</div><div>while True:</div><div><span class="Apple-tab-span" style="white-space:pre">	</span>f, g = tee(f)</div><div><br></div><div>Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected? &nbsp;I have no idea. &nbsp;But I do know that no such problems exist with the two file descriptor versions.</div><div><br><div><br></div><div><br></div><div><br></div><div><br></div><br><blockquote type="cite"><div>I would expect this to be more<br>efficient as well as less error-prone on Windows.<br><br><br>Oscar<br><br></div></blockquote></div><br><div apple-content-edited="true">
> <div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><br class="Apple-interchange-newline">---</div><div>Roy Smith</div><div><a href="mailto:roy@panix.com">roy@panix.com</a></div><div><br></div></div></span></div></div><br class="Apple-interchange-newline">
> </div>
> <br></body></html>
>

And if you're willing to ignore the possibility that the text file has
unix line endings, I'm willing to ignore the possibility that the text
file has a huge number of lines.  Everything is MUCH simpler if one
assumes readlines() will work.  Most of these other approaches are much
more complex than the OP probably needs, if he ever gets around to
actually describing his requirements.

BTW, please post in text, all that html is really annoying.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#54418

From	Peter Otten <__peter__@web.de>
Date	2013-09-19 09:23 +0200
Message-ID	<mailman.146.1379575403.18130.python-list@python.org>
In reply to	#54379

Roy Smith wrote:

>> Dave Angel <davea@davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects.  Then you
>>> can iterate over them independently.
> 
> 
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
> 
> There's no reason NOT to use OS resources.  That's what the OS is there
> for; to make life easier on application programmers.  Opening a file twice
> costs almost nothing.  File descriptors are almost as cheap as whitespace.
> 
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
> 
> I believe by "Peter's version", you're talking about:
> 
>> from itertools import islice, tee
>> 
>> with open("tmp.txt") as f:
>>     while True:
>>         for outer in f:
>>             print outer,
>>             if "*" in outer:
>>                 f, g = tee(f)
>>                 for inner in islice(g, 3):
>>                     print "   ", inner,
                   del g # a good idea in the general case
>>                 break
>>         else:
>>             break
> 
> 
> There's this note from
> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
> 
>> This itertool may require significant auxiliary storage (depending on how
>> much temporary data needs to be stored). In general, if one iterator uses
>> most or all of the data before another iterator starts, it is faster to
>> use list() instead of tee().
> 
> 
> I have no idea how that interacts with the pattern above where you call
> tee() serially.  

As I understand it the above says that

items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
   pass
for pair in izip(a, b):
    pass

stores 1000 items and can go on forever, but

items = infinite()
a, b = tee(items)
for item in a:
    pass

will consume unbounded memory and that if items is finite using a list 
instead of tee is more efficient. The documentation says nothing about

items = infinite()
a, b = tee(items)
del a
for item in b:
   pass

so you have to trust Mr Hettinger or come up with a test case...

> You're basically doing
> 
> with open("my_file") as f:
> while True:
>     f, g = tee(f)
> 
> Are all of those g's just hanging around, eating up memory, while waiting
> to be garbage collected?  I have no idea.  

I'd say you've just devised a nice test to find out ;)

> But I do know that no such
> problems exist with the two file descriptor versions.

The trade-offs are different. My version works with arbitrary iterators 
(think stdin), but will consume unbounded amounts of memory when the inner 
loop doesn't stop.

[toc] | [prev] | [next] | [standalone]

#54422

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-09-19 15:16 +0100
Message-ID	<mailman.148.1379600226.18130.python-list@python.org>
In reply to	#54379

On 19 September 2013 08:23, Peter Otten <__peter__@web.de> wrote:
> Roy Smith wrote:
>>
>> I believe by "Peter's version", you're talking about:
>>
>>> from itertools import islice, tee
>>>
>>> with open("tmp.txt") as f:
>>>     while True:
>>>         for outer in f:
>>>             print outer,
>>>             if "*" in outer:
>>>                 f, g = tee(f)
>>>                 for inner in islice(g, 3):
>>>                     print "   ", inner,
>                    del g # a good idea in the general case
>>>                 break
>>>         else:
>>>             break
>>
>> There's this note from
>> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>>
>>> This itertool may require significant auxiliary storage (depending on how
>>> much temporary data needs to be stored). In general, if one iterator uses
>>> most or all of the data before another iterator starts, it is faster to
>>> use list() instead of tee().

This is referring to the case where your two iterators get out of sync
by a long way. If you only consume 3 extra items it will just store
those 3 items in a list.

>> I have no idea how that interacts with the pattern above where you call
>> tee() serially.

Fair point.

>
> As I understand it the above says that
>
> items = infinite()
> a, b = tee(items)
> for item in islice(a, 1000):
>    pass
> for pair in izip(a, b):
>     pass
>
> stores 1000 items and can go on forever, but
>
> items = infinite()
> a, b = tee(items)
> for item in a:
>     pass
>
> will consume unbounded memory and that if items is finite using a list
> instead of tee is more efficient. The documentation says nothing about
>
> items = infinite()
> a, b = tee(items)
> del a
> for item in b:
>    pass
>
> so you have to trust Mr Hettinger or come up with a test case...
>
>> You're basically doing
>>
>> with open("my_file") as f:
>> while True:
>>     f, g = tee(f)
>>
>> Are all of those g's just hanging around, eating up memory, while waiting
>> to be garbage collected?  I have no idea.
>
> I'd say you've just devised a nice test to find out ;)

$ cat tee.py
#!/usr/bin/env python

import sys
from itertools import tee

items = iter(range(int(sys.argv[1])))

while True:
    for x in items:
        items, discard = tee(items)
        break
    else:
        break

print(x)

$ time py -3.3 ./tee.py 100000000
99999999

real    1m47.711s
user    0m0.015s
sys     0m0.000s

While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).

When you call tee(iterator) it creates two _tee objects and one
_teelist object. The _teelist object stores all of the items that have
been seen by only one of _tee1 and _tee2, a reference to iterator and
a flag indicating which _tee object has seen more items. When say
_tee2 is deallocated the _teelist becomes singly owned and no longer
needs to ever accumulate items (so it doesn't). So the dereferenced
discard will not cause an arbitrary growth in memory usage.

There is a separate problem which is that if you call tee() multiple
times then you end up with a chain of tees and each next call would go
through each one of them. This would cause a linear growth in the time
taken to call next() leading to quadratic time performance overall.
However, this does not occur with the script I showed above. In
principle it's possible for a _tee object to realise that there is a
chain of singly owned _tee and _teelist objects and bypass them
calling next() on the original iterator but I don't know if this is
what happens.

However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.

Here's an alternate iterator recipe that doesn't depend on these optimisations:

from itertools import islice
from collections import deque

class Peekable(object):

    def __init__(self, iterable):
        self.iterator = iter(iterable)
        self.peeked = deque()

    def __iter__(self):
        while True:
            while self.peeked:
                yield self.peeked.popleft()
            yield next(self.iterator)

    def peek(self):
        for p in self.peeked:
            yield p
        for val in self.iterator:
            self.peeked.append(val)
            yield val

with open("tmp.txt") as f:
    f = Peekable(f)
    for outer in f:
        print outer,
        if "*" in outer:
            for inner in islice(f.peek(), 3):
                print "   ", inner,

Oscar

[toc] | [prev] | [next] | [standalone]

#54424

From	Peter Otten <__peter__@web.de>
Date	2013-09-19 16:38 +0200
Message-ID	<mailman.150.1379601478.18130.python-list@python.org>
In reply to	#54379

Oscar Benjamin wrote:

> $ cat tee.py
> #!/usr/bin/env python
> 
> import sys
> from itertools import tee
> 
> items = iter(range(int(sys.argv[1])))
> 
> while True:
>     for x in items:
>         items, discard = tee(items)
>         break
>     else:
>         break
> 
> print(x)
> 
> $ time py -3.3 ./tee.py 100000000
> 99999999
> 
> real    1m47.711s
> user    0m0.015s
> sys     0m0.000s
> 
> While running the above python.exe was using 6MB of memory (according
> to Task Manager). I believe this is because tee() works as follows
> (which I made up but it's how I imagine it).

[...]

> However, when I ran the above script on Python 2.7 it did consume
> massive amounts of memory (1.6GB) and ran slower so maybe this depends
> on optimisations that were introduced in 3.x.

Did you use xrange()?

[toc] | [prev] | [next] | [standalone]

#54426

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-09-19 15:48 +0100
Message-ID	<mailman.152.1379602164.18130.python-list@python.org>
In reply to	#54379

On 19 September 2013 15:38, Peter Otten <__peter__@web.de> wrote:
>> While running the above python.exe was using 6MB of memory (according
>> to Task Manager). I believe this is because tee() works as follows
>> (which I made up but it's how I imagine it).
>
> [...]
>
>> However, when I ran the above script on Python 2.7 it did consume
>> massive amounts of memory (1.6GB) and ran slower so maybe this depends
>> on optimisations that were introduced in 3.x.
>
> Did you use xrange()?

No I didn't. :)

Okay so it only uses 4.6MB of memory and it runs at the same speed:
there's no problem with chaining tee objects as long as you discard
them. If you don't discard them then a script like the one I wrote
would quickly blow all the system memory.

Oscar

[toc] | [prev] | [next] | [standalone]

#54374

From	Peter Otten <__peter__@web.de>
Date	2013-09-18 13:44 +0200
Message-ID	<mailman.116.1379504643.18130.python-list@python.org>
In reply to	#54369

nikhil Pandey wrote:

> hi,
> I want to iterate over the lines of a file and when i find certain lines,
> i need another loop starting from the next of that "CERTAIN" line till a
> few (say 20) lines later. so, basically i need two pointers to lines (one
> for outer loop(for each line in file)) and one for inner loop. How can i
> do that in python? please help. I am stuck up on this.

Here's an example that prints the three lines following a line containing a 
'*':

Example data:

$ cat tmp.txt
alpha
*beta
*gamma
delta
epsilon
zeta
*eta

The python script:

$ cat tmp.py
from itertools import islice, tee

with open("tmp.txt") as f:
    while True:
        for outer in f:
            print outer,
            if "*" in outer:
                f, g = tee(f)
                for inner in islice(g, 3):
                    print "   ", inner,
                break
        else:
            break

The script's output:

$ python tmp.py
alpha
*beta
    *gamma
    delta
    epsilon
*gamma
    delta
    epsilon
    zeta
delta
epsilon
zeta
*eta
$ 

As you can see the general logic is relatively complex; it is likely that we 
can come up with a simpler solution if you describe your actual requirement 
in more detail.

[toc] | [prev] | [next] | [standalone]

#54376

From	nikhil Pandey <nikhilpandey90@gmail.com>
Date	2013-09-18 05:14 -0700
Message-ID	<0142cea2-e534-47d0-92fe-79a87068c497@googlegroups.com>
In reply to	#54374

On Wednesday, September 18, 2013 5:14:10 PM UTC+5:30, Peter Otten wrote:
> nikhil Pandey wrote:
> 
> 
> 
> > hi,
> 
> > I want to iterate over the lines of a file and when i find certain lines,
> 
> > i need another loop starting from the next of that "CERTAIN" line till a
> 
> > few (say 20) lines later. so, basically i need two pointers to lines (one
> 
> > for outer loop(for each line in file)) and one for inner loop. How can i
> 
> > do that in python? please help. I am stuck up on this.
> 
> 
> 
> Here's an example that prints the three lines following a line containing a 
> 
> '*':
> 
> 
> 
> Example data:
> 
> 
> 
> $ cat tmp.txt
> 
> alpha
> 
> *beta
> 
> *gamma
> 
> delta
> 
> epsilon
> 
> zeta
> 
> *eta
> 
> 
> 
> The python script:
> 
> 
> 
> $ cat tmp.py
> 
> from itertools import islice, tee
> 
> 
> 
> with open("tmp.txt") as f:
> 
>     while True:
> 
>         for outer in f:
> 
>             print outer,
> 
>             if "*" in outer:
> 
>                 f, g = tee(f)
> 
>                 for inner in islice(g, 3):
> 
>                     print "   ", inner,
> 
>                 break
> 
>         else:
> 
>             break
> 
> 
> 
> The script's output:
> 
> 
> 
> $ python tmp.py
> 
> alpha
> 
> *beta
> 
>     *gamma
> 
>     delta
> 
>     epsilon
> 
> *gamma
> 
>     delta
> 
>     epsilon
> 
>     zeta
> 
> delta
> 
> epsilon
> 
> zeta
> 
> *eta
> 
> $ 
> 
> 
> 
> As you can see the general logic is relatively complex; it is likely that we 
> 
> can come up with a simpler solution if you describe your actual requirement 
> 
> in more detail.

hi,
I want to iterate in the inner loop by reading each line till some condition is met.how can i do that. Thanks for this code.

[toc] | [prev] | [next] | [standalone]

#54378

From	Peter Otten <__peter__@web.de>
Date	2013-09-18 14:54 +0200
Message-ID	<mailman.118.1379508875.18130.python-list@python.org>
In reply to	#54376

nikhil Pandey wrote:

> On Wednesday, September 18, 2013 5:14:10 PM UTC+5:30, Peter Otten wrote:

> I want to iterate in the inner loop by reading each line till some
> condition is met.how can i do that. Thanks for this code.

That's not what I had in mind when I asked you to

>> describe your actual requirement in more detail.

Anyway, change

[...]
>>                 f, g = tee(f)
>>                 for inner in islice(g, 3):
>>                     print "   ", inner,
>>                 break
[...]

to

                f, g = tee(f)
                for inner in g:
                    if some condition:
                        break
                    print "   ", inner,
                break

in my example.

[toc] | [prev] | [next] | [standalone]

#54412

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-09-19 02:40 +0000
Message-ID	<523a6417$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to	#54376

On Wed, 18 Sep 2013 05:14:23 -0700, nikhil Pandey wrote:

> I want to iterate in the inner loop by reading each line till some
> condition is met.how can i do that. Thanks for this code.

while not condition:
    read line


Re-write using Python syntax, and you are done.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#54415

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-09-19 02:56 +0000
Message-ID	<523a67c3$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to	#54369

On Wed, 18 Sep 2013 04:12:05 -0700, nikhil Pandey wrote:

> hi,
> I want to iterate over the lines of a file and when i find certain
> lines, i need another loop starting from the next of that "CERTAIN" line
> till a few (say 20) lines later. so, basically i need two pointers to
> lines (one for outer loop(for each line in file)) and one for inner
> loop. How can i do that in python? please help. I am stuck up on this.

No, you don't "need" two pointers to lines. That is just one way to solve 
this problem. You can solve it many ways.

One way, for small files (say, under one million lines), is to read the 
whole file into a list, then have two pointers to a line:

lines = file.readlines()
p = q = 0

while p < len(lines):
    print(lines[p])
    p += 1

then advance the pointers p and q as needed. This is the most flexible 
way to do it: you can have as many pointers as needed, you can back-
track, jump forward, jump back, and it is all high-speed random-access 
memory accesses. Except for the initial readlines, none of it is slow I/O 
processing.

Another solution is to use a state-machine:

for line in somefile:
    if state == SCANNING:
        do_something()
    elif state == PROCESSING:
        do_something_else()
    elif state == WOBBLING:
        wobble()
    state = adjust_state(line)

You can combine the two, of course, and have a state machine with 
multiple pointers to a list of lines.

Using itertools.tee, you can potentially combine these solutions with the 
straightforward for-loop over a list. The danger of itertools.tee is that 
it may use as much memory as reading the entire file into memory at once, 
but the benefit is that it may use much less. But personally, I find list-
based processing with random-access by index much easier to understand 
that itertools.tee solutions.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#54417

From	Joshua Landau <joshua@landau.ws>
Date	2013-09-19 08:04 +0100
Message-ID	<mailman.145.1379574349.18130.python-list@python.org>
In reply to	#54415

Although "tee" is most certainly preferable because IO is far slower
than the small amounts of memory "tee" will use, you do have this
option:

    def iterate_file_lines(file):
        """
        Iterate over lines in a file, unlike normal
        iteration this allows seeking.
        """
        while True:
            line = thefile.readline()
            if not line:
                break

            yield line


    thefile = open("/tmp/thefile")
    thelines = iterate_file_lines(thefile)

    for line in thelines:
        print("Outer:", repr(line))

        if is_start(line):
            outer_position = thefile.tell()

            for line in thelines:
                print("Inner:", repr(line))

                if is_end(line):
                    break

            thefile.seek(outer_position)

It's simpler than having two files but probably not faster, "tee" will
almost definitely be way better a choice (unless the subsections can't
fit in memory) and it forfeits being able to change up the order of
these things.

If you want to change up the order to another defined order, you can
think about storing the subsections, but if you want to support
independent iteration you'll need to seek before every "readline"
which is a bit silly.

Basically, read it all into memory like Steven D'Aprano suggested. If
you really don't want to, use "tee". If you can't handle non-constant
memory usage (really? You're reading lines, man) I'd suggest my
method. If you can't handle the inflexibility there, use multiple
files.

There, is that enough choices?

[toc] | [prev] | [standalone]

csiph-web

iterating over a file with two pointers

Contents

#54369 — iterating over a file with two pointers

#54370

#54375

#54388

#54372

#54379

#54380

#54382

#54399

#54418

#54422

#54424

#54426

#54374

#54376

#54378

#54412

#54415

#54417