Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #54369 > unrolled thread
| Started by | nikhil Pandey <nikhilpandey90@gmail.com> |
|---|---|
| First post | 2013-09-18 04:12 -0700 |
| Last post | 2013-09-19 08:04 +0100 |
| Articles | 19 — 9 participants |
Back to article view | Back to comp.lang.python
iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 04:12 -0700
Re: iterating over a file with two pointers Chris Angelico <rosuav@gmail.com> - 2013-09-18 21:21 +1000
Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:07 -0700
Re: iterating over a file with two pointers Travis Griggs <travisgriggs@gmail.com> - 2013-09-18 09:18 -0700
Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 11:39 +0000
Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 08:56 -0400
Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-18 14:09 +0100
Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 10:36 -0400
Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 20:07 +0000
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 09:23 +0200
Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:16 +0100
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 16:38 +0200
Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:48 +0100
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 13:44 +0200
Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:14 -0700
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 14:54 +0200
Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:40 +0000
Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:56 +0000
Re: iterating over a file with two pointers Joshua Landau <joshua@landau.ws> - 2013-09-19 08:04 +0100
| From | nikhil Pandey <nikhilpandey90@gmail.com> |
|---|---|
| Date | 2013-09-18 04:12 -0700 |
| Subject | iterating over a file with two pointers |
| Message-ID | <3018b3d4-f914-4c89-9f26-cd4b2af32e73@googlegroups.com> |
hi, I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later. so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python? please help. I am stuck up on this.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-09-18 21:21 +1000 |
| Message-ID | <mailman.113.1379503314.18130.python-list@python.org> |
| In reply to | #54369 |
On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote: > hi, > I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later. > so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python? > please help. I am stuck up on this. After the inner loop finishes, do you want to go back to where the outer loop left off, or should the outer loop continue from the point where the inner loop stopped? In other words, do you want to locate overlapping sections, or not? Both are possible, but the solutions will look somewhat different. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | nikhil Pandey <nikhilpandey90@gmail.com> |
|---|---|
| Date | 2013-09-18 05:07 -0700 |
| Message-ID | <e30d9950-7b29-43ed-b85b-455a8d0e9fee@googlegroups.com> |
| In reply to | #54370 |
On Wednesday, September 18, 2013 4:51:51 PM UTC+5:30, Chris Angelico wrote: > On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote: > > > hi, > > > I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later. > > > so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python? > > > please help. I am stuck up on this. > > > > After the inner loop finishes, do you want to go back to where the > > outer loop left off, or should the outer loop continue from the point > > where the inner loop stopped? In other words, do you want to locate > > overlapping sections, or not? Both are possible, but the solutions > > will look somewhat different. > > > > ChrisA Hi Chris, After the inner loop finishes, I want to go back to the next line from where the outer loop was left i.e the lines of the inner loop will be traversed again in the outer loop. 1>>I iterate over lines of the file 2>> when i find a match in a certain line, i start another loop till some condition is met in the subsequent lines 3>> then i come back to where i left and repeat 1(ideally i want to delete that line in inner loop where that condition is met, but even if it is not deleted, its OK)
[toc] | [prev] | [next] | [standalone]
| From | Travis Griggs <travisgriggs@gmail.com> |
|---|---|
| Date | 2013-09-18 09:18 -0700 |
| Message-ID | <mailman.126.1379521144.18130.python-list@python.org> |
| In reply to | #54375 |
On Sep 18, 2013, at 5:07 AM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
> On Wednesday, September 18, 2013 4:51:51 PM UTC+5:30, Chris Angelico wrote:
>> On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
>>
>>> hi,
>>
>>> I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
>>
>>> so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
>>
>>> please help. I am stuck up on this.
>>
>>
>>
>> After the inner loop finishes, do you want to go back to where the
>>
>> outer loop left off, or should the outer loop continue from the point
>>
>> where the inner loop stopped? In other words, do you want to locate
>>
>> overlapping sections, or not? Both are possible, but the solutions
>>
>> will look somewhat different.
>>
>>
>>
>> ChrisA
>
> Hi Chris,
> After the inner loop finishes, I want to go back to the next line from where the outer loop was left i.e the lines of the inner loop will be traversed again in the outer loop.
> 1>>I iterate over lines of the file
> 2>> when i find a match in a certain line, i start another loop till some condition is met in the subsequent lines
> 3>> then i come back to where i left and repeat 1(ideally i want to delete that line in inner loop where that condition is met, but even if it is not deleted, its OK)
Just curious, do you really need two loops and file handles? Without better details about what you're really doing, but as you've provided more detail, it seems to me that just iterating the lines of the file, and using a latch boolean to indicate when you should do additional processing on lines might be easier. I modified Chris's example input to look like:
alpha
*beta
gamma+
delta
epsilon
zeta
*eta
kappa
tau
pi+
omicron
And then shot it with the following:
#!/usr/bin/env python3
with open("samplein.txt") as file:
reversing = False
for line in (raw.strip() for raw in file):
if reversing:
print('____', line[::-1], '____')
reversing = not line.endswith('+')
else:
print(line)
reversing = line.startswith('*')
Which begins reversing lines as its working through them, until a different condition is met.
Travis Griggs
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-09-18 11:39 +0000 |
| Message-ID | <mailman.115.1379504419.18130.python-list@python.org> |
| In reply to | #54369 |
On 18/9/2013 07:21, Chris Angelico wrote:
> On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com> wrote:
>> hi,
>> I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
>> so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
>> please help. I am stuck up on this.
>
> After the inner loop finishes, do you want to go back to where the
> outer loop left off, or should the outer loop continue from the point
> where the inner loop stopped? In other words, do you want to locate
> overlapping sections, or not? Both are possible, but the solutions
> will look somewhat different.
>
In addition, is this really a text file? For binary files, you could
use seek(), and manage things yourself. But that's not strictly legal
in a text file, and may work on one system, not on another.
I'd suggest you open the file twice, and get two file objects. Then you
can iterate over them independently.
Or if the file is under a few hundred meg, just do a readlines, and do
the two iterators over that. That way, the inner loop could just
iterate over a simple slice.
infile = open(.... "rb")
lines = infile.readlines()
infile.close()
for index, line in enumerate(lines):
for inner in lines[index+1:20]:
...
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-09-18 08:56 -0400 |
| Message-ID | <roy-B13238.08561818092013@news.panix.com> |
| In reply to | #54372 |
> > On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com>
> > wrote:
> >> hi,
> >> I want to iterate over the lines of a file and when i find certain lines,
> >> i need another loop starting from the next of that "CERTAIN" line till a
> >> few (say 20) lines later.
> >> so, basically i need two pointers to lines (one for outer loop(for each
> >> line in file)) and one for inner loop. How can i do that in python?
> >> please help. I am stuck up on this.
> [...]
In article <mailman.115.1379504419.18130.python-list@python.org>,
Dave Angel <davea@davea.name> wrote:
[I hope I unwound the multi-layer quoting right]
> In addition, is this really a text file? For binary files, you could
> use seek(), and manage things yourself. But that's not strictly legal
> in a text file, and may work on one system, not on another.
Why is seek() not legal on a text file? The only issue I'm aware of is
the note at http://docs.python.org/2/library/stdtypes.html, which says:
"On Windows, tell() can return illegal values (after an fgets()) when
reading files with Unix-style line-endings. Use binary mode ('rb') to
circumvent this problem."
so, don't do that (i.e. read unix-line-terminated files on windows).
But assuming you're not in that situation, it seems like something like
this this should work:
> I'd suggest you open the file twice, and get two file objects. Then you
> can iterate over them independently.
and use tell() to keep them in sync. Something along the lines of (not
tested):
f1 = open("my_file")
f2 = open("my_file")
while True:
where = f1.tell()
line = f1.readline()
if not line:
break
if matches_pattern(line):
f2.seek(where)
for i in range(20):
line = f2.readline()
print line
Except for the specific case noted above (i.e. reading a unix file on a
windows box, so don't do that), it doesn't matter that seek() does funny
things with windows line endings, because tell() does the same funny
things. Doing f2.seek(f1.tell()) will get the two file pointers into
the same place in both files.
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-09-18 14:09 +0100 |
| Message-ID | <mailman.119.1379509785.18130.python-list@python.org> |
| In reply to | #54379 |
On 18 September 2013 13:56, Roy Smith <roy@panix.com> wrote:
>
>> > On Wed, Sep 18, 2013 at 9:12 PM, nikhil Pandey <nikhilpandey90@gmail.com>
>> > wrote:
>> >> hi,
>> >> I want to iterate over the lines of a file and when i find certain lines,
>> >> i need another loop starting from the next of that "CERTAIN" line till a
>> >> few (say 20) lines later.
>> >> so, basically i need two pointers to lines (one for outer loop(for each
>> >> line in file)) and one for inner loop. How can i do that in python?
>> >> please help. I am stuck up on this.
>> [...]
>
> In article <mailman.115.1379504419.18130.python-list@python.org>,
> Dave Angel <davea@davea.name> wrote:
> [I hope I unwound the multi-layer quoting right]
>> In addition, is this really a text file? For binary files, you could
>> use seek(), and manage things yourself. But that's not strictly legal
>> in a text file, and may work on one system, not on another.
>
> Why is seek() not legal on a text file? The only issue I'm aware of is
> the note at http://docs.python.org/2/library/stdtypes.html, which says:
>
> "On Windows, tell() can return illegal values (after an fgets()) when
> reading files with Unix-style line-endings. Use binary mode ('rb') to
> circumvent this problem."
>
> so, don't do that (i.e. read unix-line-terminated files on windows).
> But assuming you're not in that situation, it seems like something like
> this this should work:
>
>> I'd suggest you open the file twice, and get two file objects. Then you
>> can iterate over them independently.
There's no need to use OS resources by opening the file twice or to
screw up the IO caching with seek(). Peter's version holds just as
many lines as is necessary in an internal Python buffer and performs
the minimum possible amount of IO. I would expect this to be more
efficient as well as less error-prone on Windows.
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-09-18 10:36 -0400 |
| Message-ID | <mailman.120.1379515006.18130.python-list@python.org> |
| In reply to | #54379 |
[Multipart message — attachments visible in raw view] — view raw
> Dave Angel <davea@davea.name> wrote (and I agreed with):
>> I'd suggest you open the file twice, and get two file objects. Then you
>> can iterate over them independently.
On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
> There's no need to use OS resources by opening the file twice or to
> screw up the IO caching with seek().
There's no reason NOT to use OS resources. That's what the OS is there for; to make life easier on application programmers. Opening a file twice costs almost nothing. File descriptors are almost as cheap as whitespace.
> Peter's version holds just as many lines as is necessary in an
> internal Python buffer and performs the minimum possible
> amount of IO.
I believe by "Peter's version", you're talking about:
> from itertools import islice, tee
>
> with open("tmp.txt") as f:
> while True:
> for outer in f:
> print outer,
> if "*" in outer:
> f, g = tee(f)
> for inner in islice(g, 3):
> print " ", inner,
> break
> else:
> break
There's this note from http://docs.python.org/2.7/library/itertools.html#itertools.tee:
> This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
I have no idea how that interacts with the pattern above where you call tee() serially. You're basically doing
with open("my_file") as f:
while True:
f, g = tee(f)
Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected? I have no idea. But I do know that no such problems exist with the two file descriptor versions.
> I would expect this to be more
> efficient as well as less error-prone on Windows.
>
>
> Oscar
>
---
Roy Smith
roy@panix.com
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-09-18 20:07 +0000 |
| Message-ID | <mailman.134.1379534879.18130.python-list@python.org> |
| In reply to | #54379 |
On 18/9/2013 10:36, Roy Smith wrote:
>> Dave Angel <davea@davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects. Then you
>>> can iterate over them independently.
>
>
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
>
> There's no reason NOT to use OS resources. That's what the OS is there for; to make life easier on application programmers. Opening a file twice costs almost nothing. File descriptors are almost as cheap as whitespace.
>
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
>
> I believe by "Peter's version", you're talking about:
>
>> from itertools import islice, tee
>>
>> with open("tmp.txt") as f:
>> while True:
>> for outer in f:
>> print outer,
>> if "*" in outer:
>> f, g = tee(f)
>> for inner in islice(g, 3):
>> print " ", inner,
>> break
>> else:
>> break
>
>
> There's this note from http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>
>> This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
>
>
> I have no idea how that interacts with the pattern above where you call tee() serially. You're basically doing
>
> with open("my_file") as f:
> while True:
> f, g = tee(f)
>
> Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected? I have no idea. But I do know that no such problems exist with the two file descriptor versions.
>
>
>
>
>
>
>> I would expect this to be more
>> efficient as well as less error-prone on Windows.
>>
>>
>> Oscar
>>
>
>
> ---
> Roy Smith
> roy@panix.com
>
>
>
>
>
> <html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><div><blockquote type="cite">Dave Angel <<a href="mailto:davea@davea.name">davea@davea.name</a>> wrote (and I agreed with):<br></blockquote><blockquote type="cite"><blockquote type="cite">I'd suggest you open the file twice, and get two file objects. Then you<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">can iterate over them independently.<br></blockquote></blockquote></div><div><br></div><div>On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:</div><blockquote type="cite"><div>There's no need to use OS resources by opening the file twice or to<br>screw up the IO caching with seek().</div></blockquote><div><br></div><div>There's no reason NOT to use OS resources. That's what the OS is there for; to make life easier on application programmers. Opening a file twice costs almost nothing. File descriptors are almost as cheap as whitespace.</div><div><br></div><blockquote type="cite"><div>Peter's version holds just as many lines as is necessary in an</div></blockquote><blockquote type="cite"><div>internal Python buffer and performs the minimum possible</div></blockquote><blockquote type="cite"><div>amount of IO.</div></blockquote><div><br></div><div>I believe by "Peter's version", you're talking about:</div><div><br></div><div></div><blockquote type="cite"><div><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">from itertools import islice, tee </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;">with open("tmp.txt") as f: </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> while True: </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> for outer in f: </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> print outer, </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> if "*" in outer: </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> f, g = tee(f) </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> for inner in islice(g, 3): </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> print " ", inner, </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> break </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> else: </span><br style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);"><span style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); display: inline !important; float: none;"> break </span></div></blockquote><div><br></div><div><br></div>There's this note from <a href="http://docs.python.org/2.7/library/itertools.html#itertools.tee">http://docs.python.org/2.7/library/itertools.html#itertools.tee</a>:</div><div><br></div><div><blockquote type="cite">This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().<span></span></blockquote></div><div><br></div><div>I have no idea how that interacts with the pattern above where you call tee() serially. You're basically doing</div><div><br></div><div>with open("my_file") as f:</div><div>while True:</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>f, g = tee(f)</div><div><br></div><div>Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected? I have no idea. But I do know that no such problems exist with the two file descriptor versions.</div><div><br><div><br></div><div><br></div><div><br></div><div><br></div><br><blockquote type="cite"><div>I would expect this to be more<br>efficient as well as less error-prone on Windows.<br><br><br>Oscar<br><br></div></blockquote></div><br><div apple-content-edited="true">
> <div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><br class="Apple-interchange-newline">---</div><div>Roy Smith</div><div><a href="mailto:roy@panix.com">roy@panix.com</a></div><div><br></div></div></span></div></div><br class="Apple-interchange-newline">
> </div>
> <br></body></html>
>
And if you're willing to ignore the possibility that the text file has
unix line endings, I'm willing to ignore the possibility that the text
file has a huge number of lines. Everything is MUCH simpler if one
assumes readlines() will work. Most of these other approaches are much
more complex than the OP probably needs, if he ever gets around to
actually describing his requirements.
BTW, please post in text, all that html is really annoying.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-09-19 09:23 +0200 |
| Message-ID | <mailman.146.1379575403.18130.python-list@python.org> |
| In reply to | #54379 |
Roy Smith wrote:
>> Dave Angel <davea@davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects. Then you
>>> can iterate over them independently.
>
>
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
>
> There's no reason NOT to use OS resources. That's what the OS is there
> for; to make life easier on application programmers. Opening a file twice
> costs almost nothing. File descriptors are almost as cheap as whitespace.
>
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
>
> I believe by "Peter's version", you're talking about:
>
>> from itertools import islice, tee
>>
>> with open("tmp.txt") as f:
>> while True:
>> for outer in f:
>> print outer,
>> if "*" in outer:
>> f, g = tee(f)
>> for inner in islice(g, 3):
>> print " ", inner,
del g # a good idea in the general case
>> break
>> else:
>> break
>
>
> There's this note from
> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>
>> This itertool may require significant auxiliary storage (depending on how
>> much temporary data needs to be stored). In general, if one iterator uses
>> most or all of the data before another iterator starts, it is faster to
>> use list() instead of tee().
>
>
> I have no idea how that interacts with the pattern above where you call
> tee() serially.
As I understand it the above says that
items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
pass
for pair in izip(a, b):
pass
stores 1000 items and can go on forever, but
items = infinite()
a, b = tee(items)
for item in a:
pass
will consume unbounded memory and that if items is finite using a list
instead of tee is more efficient. The documentation says nothing about
items = infinite()
a, b = tee(items)
del a
for item in b:
pass
so you have to trust Mr Hettinger or come up with a test case...
> You're basically doing
>
> with open("my_file") as f:
> while True:
> f, g = tee(f)
>
> Are all of those g's just hanging around, eating up memory, while waiting
> to be garbage collected? I have no idea.
I'd say you've just devised a nice test to find out ;)
> But I do know that no such
> problems exist with the two file descriptor versions.
The trade-offs are different. My version works with arbitrary iterators
(think stdin), but will consume unbounded amounts of memory when the inner
loop doesn't stop.
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-09-19 15:16 +0100 |
| Message-ID | <mailman.148.1379600226.18130.python-list@python.org> |
| In reply to | #54379 |
On 19 September 2013 08:23, Peter Otten <__peter__@web.de> wrote:
> Roy Smith wrote:
>>
>> I believe by "Peter's version", you're talking about:
>>
>>> from itertools import islice, tee
>>>
>>> with open("tmp.txt") as f:
>>> while True:
>>> for outer in f:
>>> print outer,
>>> if "*" in outer:
>>> f, g = tee(f)
>>> for inner in islice(g, 3):
>>> print " ", inner,
> del g # a good idea in the general case
>>> break
>>> else:
>>> break
>>
>> There's this note from
>> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>>
>>> This itertool may require significant auxiliary storage (depending on how
>>> much temporary data needs to be stored). In general, if one iterator uses
>>> most or all of the data before another iterator starts, it is faster to
>>> use list() instead of tee().
This is referring to the case where your two iterators get out of sync
by a long way. If you only consume 3 extra items it will just store
those 3 items in a list.
>> I have no idea how that interacts with the pattern above where you call
>> tee() serially.
Fair point.
>
> As I understand it the above says that
>
> items = infinite()
> a, b = tee(items)
> for item in islice(a, 1000):
> pass
> for pair in izip(a, b):
> pass
>
> stores 1000 items and can go on forever, but
>
> items = infinite()
> a, b = tee(items)
> for item in a:
> pass
>
> will consume unbounded memory and that if items is finite using a list
> instead of tee is more efficient. The documentation says nothing about
>
> items = infinite()
> a, b = tee(items)
> del a
> for item in b:
> pass
>
> so you have to trust Mr Hettinger or come up with a test case...
>
>> You're basically doing
>>
>> with open("my_file") as f:
>> while True:
>> f, g = tee(f)
>>
>> Are all of those g's just hanging around, eating up memory, while waiting
>> to be garbage collected? I have no idea.
>
> I'd say you've just devised a nice test to find out ;)
$ cat tee.py
#!/usr/bin/env python
import sys
from itertools import tee
items = iter(range(int(sys.argv[1])))
while True:
for x in items:
items, discard = tee(items)
break
else:
break
print(x)
$ time py -3.3 ./tee.py 100000000
99999999
real 1m47.711s
user 0m0.015s
sys 0m0.000s
While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).
When you call tee(iterator) it creates two _tee objects and one
_teelist object. The _teelist object stores all of the items that have
been seen by only one of _tee1 and _tee2, a reference to iterator and
a flag indicating which _tee object has seen more items. When say
_tee2 is deallocated the _teelist becomes singly owned and no longer
needs to ever accumulate items (so it doesn't). So the dereferenced
discard will not cause an arbitrary growth in memory usage.
There is a separate problem which is that if you call tee() multiple
times then you end up with a chain of tees and each next call would go
through each one of them. This would cause a linear growth in the time
taken to call next() leading to quadratic time performance overall.
However, this does not occur with the script I showed above. In
principle it's possible for a _tee object to realise that there is a
chain of singly owned _tee and _teelist objects and bypass them
calling next() on the original iterator but I don't know if this is
what happens.
However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.
Here's an alternate iterator recipe that doesn't depend on these optimisations:
from itertools import islice
from collections import deque
class Peekable(object):
def __init__(self, iterable):
self.iterator = iter(iterable)
self.peeked = deque()
def __iter__(self):
while True:
while self.peeked:
yield self.peeked.popleft()
yield next(self.iterator)
def peek(self):
for p in self.peeked:
yield p
for val in self.iterator:
self.peeked.append(val)
yield val
with open("tmp.txt") as f:
f = Peekable(f)
for outer in f:
print outer,
if "*" in outer:
for inner in islice(f.peek(), 3):
print " ", inner,
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-09-19 16:38 +0200 |
| Message-ID | <mailman.150.1379601478.18130.python-list@python.org> |
| In reply to | #54379 |
Oscar Benjamin wrote: > $ cat tee.py > #!/usr/bin/env python > > import sys > from itertools import tee > > items = iter(range(int(sys.argv[1]))) > > while True: > for x in items: > items, discard = tee(items) > break > else: > break > > print(x) > > $ time py -3.3 ./tee.py 100000000 > 99999999 > > real 1m47.711s > user 0m0.015s > sys 0m0.000s > > While running the above python.exe was using 6MB of memory (according > to Task Manager). I believe this is because tee() works as follows > (which I made up but it's how I imagine it). [...] > However, when I ran the above script on Python 2.7 it did consume > massive amounts of memory (1.6GB) and ran slower so maybe this depends > on optimisations that were introduced in 3.x. Did you use xrange()?
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-09-19 15:48 +0100 |
| Message-ID | <mailman.152.1379602164.18130.python-list@python.org> |
| In reply to | #54379 |
On 19 September 2013 15:38, Peter Otten <__peter__@web.de> wrote: >> While running the above python.exe was using 6MB of memory (according >> to Task Manager). I believe this is because tee() works as follows >> (which I made up but it's how I imagine it). > > [...] > >> However, when I ran the above script on Python 2.7 it did consume >> massive amounts of memory (1.6GB) and ran slower so maybe this depends >> on optimisations that were introduced in 3.x. > > Did you use xrange()? No I didn't. :) Okay so it only uses 4.6MB of memory and it runs at the same speed: there's no problem with chaining tee objects as long as you discard them. If you don't discard them then a script like the one I wrote would quickly blow all the system memory. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-09-18 13:44 +0200 |
| Message-ID | <mailman.116.1379504643.18130.python-list@python.org> |
| In reply to | #54369 |
nikhil Pandey wrote:
> hi,
> I want to iterate over the lines of a file and when i find certain lines,
> i need another loop starting from the next of that "CERTAIN" line till a
> few (say 20) lines later. so, basically i need two pointers to lines (one
> for outer loop(for each line in file)) and one for inner loop. How can i
> do that in python? please help. I am stuck up on this.
Here's an example that prints the three lines following a line containing a
'*':
Example data:
$ cat tmp.txt
alpha
*beta
*gamma
delta
epsilon
zeta
*eta
The python script:
$ cat tmp.py
from itertools import islice, tee
with open("tmp.txt") as f:
while True:
for outer in f:
print outer,
if "*" in outer:
f, g = tee(f)
for inner in islice(g, 3):
print " ", inner,
break
else:
break
The script's output:
$ python tmp.py
alpha
*beta
*gamma
delta
epsilon
*gamma
delta
epsilon
zeta
delta
epsilon
zeta
*eta
$
As you can see the general logic is relatively complex; it is likely that we
can come up with a simpler solution if you describe your actual requirement
in more detail.
[toc] | [prev] | [next] | [standalone]
| From | nikhil Pandey <nikhilpandey90@gmail.com> |
|---|---|
| Date | 2013-09-18 05:14 -0700 |
| Message-ID | <0142cea2-e534-47d0-92fe-79a87068c497@googlegroups.com> |
| In reply to | #54374 |
On Wednesday, September 18, 2013 5:14:10 PM UTC+5:30, Peter Otten wrote:
> nikhil Pandey wrote:
>
>
>
> > hi,
>
> > I want to iterate over the lines of a file and when i find certain lines,
>
> > i need another loop starting from the next of that "CERTAIN" line till a
>
> > few (say 20) lines later. so, basically i need two pointers to lines (one
>
> > for outer loop(for each line in file)) and one for inner loop. How can i
>
> > do that in python? please help. I am stuck up on this.
>
>
>
> Here's an example that prints the three lines following a line containing a
>
> '*':
>
>
>
> Example data:
>
>
>
> $ cat tmp.txt
>
> alpha
>
> *beta
>
> *gamma
>
> delta
>
> epsilon
>
> zeta
>
> *eta
>
>
>
> The python script:
>
>
>
> $ cat tmp.py
>
> from itertools import islice, tee
>
>
>
> with open("tmp.txt") as f:
>
> while True:
>
> for outer in f:
>
> print outer,
>
> if "*" in outer:
>
> f, g = tee(f)
>
> for inner in islice(g, 3):
>
> print " ", inner,
>
> break
>
> else:
>
> break
>
>
>
> The script's output:
>
>
>
> $ python tmp.py
>
> alpha
>
> *beta
>
> *gamma
>
> delta
>
> epsilon
>
> *gamma
>
> delta
>
> epsilon
>
> zeta
>
> delta
>
> epsilon
>
> zeta
>
> *eta
>
> $
>
>
>
> As you can see the general logic is relatively complex; it is likely that we
>
> can come up with a simpler solution if you describe your actual requirement
>
> in more detail.
hi,
I want to iterate in the inner loop by reading each line till some condition is met.how can i do that. Thanks for this code.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-09-18 14:54 +0200 |
| Message-ID | <mailman.118.1379508875.18130.python-list@python.org> |
| In reply to | #54376 |
nikhil Pandey wrote:
> On Wednesday, September 18, 2013 5:14:10 PM UTC+5:30, Peter Otten wrote:
> I want to iterate in the inner loop by reading each line till some
> condition is met.how can i do that. Thanks for this code.
That's not what I had in mind when I asked you to
>> describe your actual requirement in more detail.
Anyway, change
[...]
>> f, g = tee(f)
>> for inner in islice(g, 3):
>> print " ", inner,
>> break
[...]
to
f, g = tee(f)
for inner in g:
if some condition:
break
print " ", inner,
break
in my example.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-09-19 02:40 +0000 |
| Message-ID | <523a6417$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #54376 |
On Wed, 18 Sep 2013 05:14:23 -0700, nikhil Pandey wrote:
> I want to iterate in the inner loop by reading each line till some
> condition is met.how can i do that. Thanks for this code.
while not condition:
read line
Re-write using Python syntax, and you are done.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-09-19 02:56 +0000 |
| Message-ID | <523a67c3$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #54369 |
On Wed, 18 Sep 2013 04:12:05 -0700, nikhil Pandey wrote:
> hi,
> I want to iterate over the lines of a file and when i find certain
> lines, i need another loop starting from the next of that "CERTAIN" line
> till a few (say 20) lines later. so, basically i need two pointers to
> lines (one for outer loop(for each line in file)) and one for inner
> loop. How can i do that in python? please help. I am stuck up on this.
No, you don't "need" two pointers to lines. That is just one way to solve
this problem. You can solve it many ways.
One way, for small files (say, under one million lines), is to read the
whole file into a list, then have two pointers to a line:
lines = file.readlines()
p = q = 0
while p < len(lines):
print(lines[p])
p += 1
then advance the pointers p and q as needed. This is the most flexible
way to do it: you can have as many pointers as needed, you can back-
track, jump forward, jump back, and it is all high-speed random-access
memory accesses. Except for the initial readlines, none of it is slow I/O
processing.
Another solution is to use a state-machine:
for line in somefile:
if state == SCANNING:
do_something()
elif state == PROCESSING:
do_something_else()
elif state == WOBBLING:
wobble()
state = adjust_state(line)
You can combine the two, of course, and have a state machine with
multiple pointers to a list of lines.
Using itertools.tee, you can potentially combine these solutions with the
straightforward for-loop over a list. The danger of itertools.tee is that
it may use as much memory as reading the entire file into memory at once,
but the benefit is that it may use much less. But personally, I find list-
based processing with random-access by index much easier to understand
that itertools.tee solutions.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Joshua Landau <joshua@landau.ws> |
|---|---|
| Date | 2013-09-19 08:04 +0100 |
| Message-ID | <mailman.145.1379574349.18130.python-list@python.org> |
| In reply to | #54415 |
Although "tee" is most certainly preferable because IO is far slower
than the small amounts of memory "tee" will use, you do have this
option:
def iterate_file_lines(file):
"""
Iterate over lines in a file, unlike normal
iteration this allows seeking.
"""
while True:
line = thefile.readline()
if not line:
break
yield line
thefile = open("/tmp/thefile")
thelines = iterate_file_lines(thefile)
for line in thelines:
print("Outer:", repr(line))
if is_start(line):
outer_position = thefile.tell()
for line in thelines:
print("Inner:", repr(line))
if is_end(line):
break
thefile.seek(outer_position)
It's simpler than having two files but probably not faster, "tee" will
almost definitely be way better a choice (unless the subsections can't
fit in memory) and it forfeits being able to change up the order of
these things.
If you want to change up the order to another defined order, you can
think about storing the subsections, but if you want to support
independent iteration you'll need to seek before every "readline"
which is a bit silly.
Basically, read it all into memory like Steven D'Aprano suggested. If
you really don't want to, use "tee". If you can't handle non-constant
memory usage (really? You're reading lines, man) I'd suggest my
method. If you can't handle the inflexibility there, use multiple
files.
There, is that enough choices?
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web