Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #54422
| Path | csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <oscar.j.benjamin@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; '(so': 0.07; 'amounts': 0.07; 'discard': 0.07; 'hettinger': 0.07; 'indicating': 0.07; 'referring': 0.07; 'subject:file': 0.07; 'subject:two': 0.07; 'sys': 0.07; 'happens.': 0.09; 'linear': 0.09; 'val': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'def': 0.12; '2.7': 0.14; 'creates': 0.14; '"*"': 0.16; '1000):': 0.16; '100000000': 0.16; '3):': 0.16; 'b):': 0.16; 'cc:name:python list': 0.16; 'collections': 0.16; 'deallocated': 0.16; 'deque': 0.16; 'finite': 0.16; 'garbage': 0.16; 'iterator': 0.16; 'iterators': 0.16; 'itertools': 0.16; 'recipe': 0.16; 'roy': 0.16; 'sync': 0.16; 'to:addr:web.de': 0.16; 'true:': 0.16; 'wrote:': 0.18; 'basically': 0.19; 'items.': 0.19; '>>>': 0.22; 'memory': 0.22; 'import': 0.22; 'separate': 0.22; 'cc:addr:python.org': 0.22; 'print': 0.22; 'alternate': 0.24; 'cc:2**0': 0.24; 'script': 0.25; 'task': 0.26; 'pass': 0.26; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; "doesn't": 0.30; 'message-id:@mail.gmail.com': 0.30; '(which': 0.31; 'follows': 0.31; 'object.': 0.31; 'class': 0.32; 'another': 0.32; 'says': 0.33; 'url:python': 0.33; 'running': 0.33; 'becomes': 0.33; 'skip:# 10': 0.33; 'skip:_ 10': 0.34; 'maybe': 0.34; "i'd": 0.34; 'problem': 0.35; 'subject:with': 0.35; 'objects': 0.35; 'point.': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'yield': 0.36; 'doing': 0.36; 'next': 0.36; 'possible': 0.36; 'url:org': 0.36; 'two': 0.37; 'list': 0.37; 'list.': 0.37; 'performance': 0.37; 'depends': 0.38; 'massive': 0.38; 'url:library': 0.38; 'does': 0.39; 'how': 0.40; 'chain': 0.60; 'most': 0.60; 'break': 0.61; 'introduced': 0.61; "you're": 0.61; 'times': 0.62; "you've": 0.63; 'real': 0.63; 'skip:n 10': 0.64; 'more': 0.64; 'occur': 0.65; 'talking': 0.65; 'temporary': 0.65; 'believe': 0.68; 'smith': 0.68; '(according': 0.84; 'around,': 0.84; 'hanging': 0.84; 'manager).': 0.84; 'oscar': 0.84; 'otten': 0.84; 'subject:over': 0.84; 'usage.': 0.84; 'items,': 0.91; 'imagine': 0.93; '2013': 0.98 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=SZTuzVJ3lfmz3EhQROe6rHNABFjB0z4ntxTbOdDDPVQ=; b=NJ14O2ndEZk0TfPza39qQwkVyCt8zbfxDliGTSisBgbD9XgZpqPDMkG5i2G/j6W0xt o3hLURTH38x3DXZbsROlWsSEQVGAwloxzejMQ9sREasHVvCBTJ9RKauYYWRSrqJdhId1 1QBMQgi6J/VcfdhqFDiHc9ysSjIAVZXnWz9hhZWU70blAA85NqUd5ARiYH6FM3Rxf4OR lFB/jymm9UIlBfBWNtZDwYeF1qb9iEODGuRoMx7auuBbeTMJqgCevTJS/sIxVLCuJ63+ tn9i0bKNY6SDxVBNr6kufCkTzP/U2G0eSdqt8Ehjk2PkXcyXzDNQfFc6/GShpQ1H5+eW xNJQ== |
| X-Received | by 10.58.133.66 with SMTP id pa2mr1410337veb.18.1379600223253; Thu, 19 Sep 2013 07:17:03 -0700 (PDT) |
| MIME-Version | 1.0 |
| In-Reply-To | <l1e8ol$99q$1@ger.gmane.org> |
| References | <3018b3d4-f914-4c89-9f26-cd4b2af32e73@googlegroups.com> <CAPTjJmoyrJqVR29MeDzcfA9K=gGgHSuqO3uCNXGLQs7APLJByA@mail.gmail.com> <mailman.115.1379504419.18130.python-list@python.org> <roy-B13238.08561818092013@news.panix.com> <CAHVvXxQa6rsrD669kL-EeqCQFn3jKH-k=eWY5iey4RwVBD2RiA@mail.gmail.com> <52B7F7EA-C7C4-4DB6-A93C-25F4C058EB58@panix.com> <l1e8ol$99q$1@ger.gmane.org> |
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
| Date | Thu, 19 Sep 2013 15:16:43 +0100 |
| Subject | Re: iterating over a file with two pointers |
| To | Peter Otten <__peter__@web.de> |
| Content-Type | text/plain; charset=ISO-8859-1 |
| Cc | Python List <python-list@python.org> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.148.1379600226.18130.python-list@python.org> (permalink) |
| Lines | 160 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1379600226 news.xs4all.nl 15945 [2001:888:2000:d::a6]:47610 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:54422 |
Show key headers only | View raw
On 19 September 2013 08:23, Peter Otten <__peter__@web.de> wrote:
> Roy Smith wrote:
>>
>> I believe by "Peter's version", you're talking about:
>>
>>> from itertools import islice, tee
>>>
>>> with open("tmp.txt") as f:
>>> while True:
>>> for outer in f:
>>> print outer,
>>> if "*" in outer:
>>> f, g = tee(f)
>>> for inner in islice(g, 3):
>>> print " ", inner,
> del g # a good idea in the general case
>>> break
>>> else:
>>> break
>>
>> There's this note from
>> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>>
>>> This itertool may require significant auxiliary storage (depending on how
>>> much temporary data needs to be stored). In general, if one iterator uses
>>> most or all of the data before another iterator starts, it is faster to
>>> use list() instead of tee().
This is referring to the case where your two iterators get out of sync
by a long way. If you only consume 3 extra items it will just store
those 3 items in a list.
>> I have no idea how that interacts with the pattern above where you call
>> tee() serially.
Fair point.
>
> As I understand it the above says that
>
> items = infinite()
> a, b = tee(items)
> for item in islice(a, 1000):
> pass
> for pair in izip(a, b):
> pass
>
> stores 1000 items and can go on forever, but
>
> items = infinite()
> a, b = tee(items)
> for item in a:
> pass
>
> will consume unbounded memory and that if items is finite using a list
> instead of tee is more efficient. The documentation says nothing about
>
> items = infinite()
> a, b = tee(items)
> del a
> for item in b:
> pass
>
> so you have to trust Mr Hettinger or come up with a test case...
>
>> You're basically doing
>>
>> with open("my_file") as f:
>> while True:
>> f, g = tee(f)
>>
>> Are all of those g's just hanging around, eating up memory, while waiting
>> to be garbage collected? I have no idea.
>
> I'd say you've just devised a nice test to find out ;)
$ cat tee.py
#!/usr/bin/env python
import sys
from itertools import tee
items = iter(range(int(sys.argv[1])))
while True:
for x in items:
items, discard = tee(items)
break
else:
break
print(x)
$ time py -3.3 ./tee.py 100000000
99999999
real 1m47.711s
user 0m0.015s
sys 0m0.000s
While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).
When you call tee(iterator) it creates two _tee objects and one
_teelist object. The _teelist object stores all of the items that have
been seen by only one of _tee1 and _tee2, a reference to iterator and
a flag indicating which _tee object has seen more items. When say
_tee2 is deallocated the _teelist becomes singly owned and no longer
needs to ever accumulate items (so it doesn't). So the dereferenced
discard will not cause an arbitrary growth in memory usage.
There is a separate problem which is that if you call tee() multiple
times then you end up with a chain of tees and each next call would go
through each one of them. This would cause a linear growth in the time
taken to call next() leading to quadratic time performance overall.
However, this does not occur with the script I showed above. In
principle it's possible for a _tee object to realise that there is a
chain of singly owned _tee and _teelist objects and bypass them
calling next() on the original iterator but I don't know if this is
what happens.
However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.
Here's an alternate iterator recipe that doesn't depend on these optimisations:
from itertools import islice
from collections import deque
class Peekable(object):
def __init__(self, iterable):
self.iterator = iter(iterable)
self.peeked = deque()
def __iter__(self):
while True:
while self.peeked:
yield self.peeked.popleft()
yield next(self.iterator)
def peek(self):
for p in self.peeked:
yield p
for val in self.iterator:
self.peeked.append(val)
yield val
with open("tmp.txt") as f:
f = Peekable(f)
for outer in f:
print outer,
if "*" in outer:
for inner in islice(f.peek(), 3):
print " ", inner,
Oscar
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 04:12 -0700
Re: iterating over a file with two pointers Chris Angelico <rosuav@gmail.com> - 2013-09-18 21:21 +1000
Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:07 -0700
Re: iterating over a file with two pointers Travis Griggs <travisgriggs@gmail.com> - 2013-09-18 09:18 -0700
Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 11:39 +0000
Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 08:56 -0400
Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-18 14:09 +0100
Re: iterating over a file with two pointers Roy Smith <roy@panix.com> - 2013-09-18 10:36 -0400
Re: iterating over a file with two pointers Dave Angel <davea@davea.name> - 2013-09-18 20:07 +0000
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 09:23 +0200
Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:16 +0100
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-19 16:38 +0200
Re: iterating over a file with two pointers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-09-19 15:48 +0100
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 13:44 +0200
Re: iterating over a file with two pointers nikhil Pandey <nikhilpandey90@gmail.com> - 2013-09-18 05:14 -0700
Re: iterating over a file with two pointers Peter Otten <__peter__@web.de> - 2013-09-18 14:54 +0200
Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:40 +0000
Re: iterating over a file with two pointers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-19 02:56 +0000
Re: iterating over a file with two pointers Joshua Landau <joshua@landau.ws> - 2013-09-19 08:04 +0100
csiph-web