Groups > comp.lang.python > #33107 > unrolled thread

A gnarly little python loop

Started by	Roy Smith <roy@panix.com>
First post	2012-11-10 17:58 -0500
Last post	2012-11-12 20:14 -0800
Articles	10 — 8 participants

Back to article view | Back to comp.lang.python

  A gnarly little python loop Roy Smith <roy@panix.com> - 2012-11-10 17:58 -0500
    Re: A gnarly little python loop Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-10 16:17 -0700
    Re: A gnarly little python loop Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-11-11 00:23 +0000
    Re: A gnarly little python loop Steve Howell <showell@domaintools.com> - 2012-11-10 19:03 -0800
      Re: A gnarly little python loop Stefan Behnel <stefan_ml@behnel.de> - 2012-11-11 08:56 +0100
    Re: A gnarly little python loop rusi <rustompmody@gmail.com> - 2012-11-11 23:09 -0800
      Re: A gnarly little python loop rusi <rustompmody@gmail.com> - 2012-11-12 07:21 -0800
        Re: A gnarly little python loop Peter Otten <__peter__@web.de> - 2012-11-12 16:49 +0100
        Re: A gnarly little python loop Steve Howell <showell30@yahoo.com> - 2012-11-12 08:09 -0800
          Re: A gnarly little python loop rusi <rustompmody@gmail.com> - 2012-11-12 20:14 -0800

#33107 — A gnarly little python loop

From	Roy Smith <roy@panix.com>
Date	2012-11-10 17:58 -0500
Subject	A gnarly little python loop
Message-ID	<roy-9EBEAD.17581410112012@news.panix.com>

I'm trying to pull down tweets with one of the many twitter APIs.  The 
particular one I'm using (python-twitter), has a call:

data = api.GetSearch(term="foo", page=page)

The way it works, you start with page=1.  It returns a list of tweets.  
If the list is empty, there are no more tweets.  If the list is not 
empty, you can try to get more tweets by asking for page=2, page=3, etc.  
I've got:

    page = 1
    while 1:
        r = api.GetSearch(term="foo", page=page)
        if not r:
            break
        for tweet in r:
            process(tweet)
        page += 1

It works, but it seems excessively fidgety.  Is there some cleaner way 
to refactor this?

[toc] | [next] | [standalone]

#33108

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-11-10 16:17 -0700
Message-ID	<mailman.3546.1352589460.27098.python-list@python.org>
In reply to	#33107

On Sat, Nov 10, 2012 at 3:58 PM, Roy Smith <roy@panix.com> wrote:
> I'm trying to pull down tweets with one of the many twitter APIs.  The
> particular one I'm using (python-twitter), has a call:
>
> data = api.GetSearch(term="foo", page=page)
>
> The way it works, you start with page=1.  It returns a list of tweets.
> If the list is empty, there are no more tweets.  If the list is not
> empty, you can try to get more tweets by asking for page=2, page=3, etc.
> I've got:
>
>     page = 1
>     while 1:
>         r = api.GetSearch(term="foo", page=page)
>         if not r:
>             break
>         for tweet in r:
>             process(tweet)
>         page += 1
>
> It works, but it seems excessively fidgety.  Is there some cleaner way
> to refactor this?

I'd do something like this:

def get_tweets(term):
    for page in itertools.count(1):
        r = api.GetSearch(term, page)
        if not r:
            break
        for tweet in r:
            yield tweet

for tweet in get_tweets("foo"):
    process(tweet)

[toc] | [prev] | [next] | [standalone]

#33109

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-11-11 00:23 +0000
Message-ID	<509eefeb$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to	#33107

On Sat, 10 Nov 2012 17:58:14 -0500, Roy Smith wrote:

> The way it works, you start with page=1.  It returns a list of tweets.
> If the list is empty, there are no more tweets.  If the list is not
> empty, you can try to get more tweets by asking for page=2, page=3, etc.
> I've got:
> 
>     page = 1
>     while 1:
>         r = api.GetSearch(term="foo", page=page) 
>         if not r:
>             break
>         for tweet in r:
>             process(tweet)
>         page += 1
> 
> It works, but it seems excessively fidgety.  Is there some cleaner way
> to refactor this?


Seems clean enough to me. It does exactly what you need: loop until there 
are no more tweets, process each tweet.

If you're allergic to nested loops, move the inner for-loop into a 
function. Also you could get rid of the "if r: break".

page = 1
r = ["placeholder"]
while r:
    r = api.GetSearch(term="foo", page=page) 
    process_all(tweets)  # does nothing if r is empty
    page += 1


Another way would be to use a for list for the outer loop.

for page in xrange(1, sys.maxint):
    r = api.GetSearch(term="foo", page=page)
    if not r: break
    process_all(r)



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#33116

From	Steve Howell <showell@domaintools.com>
Date	2012-11-10 19:03 -0800
Message-ID	<5a260a79-818d-47a8-9404-37b014587730@px4g2000pbc.googlegroups.com>
In reply to	#33107

On Nov 10, 2:58 pm, Roy Smith <r...@panix.com> wrote:
> I'm trying to pull down tweets with one of the many twitter APIs.  The
> particular one I'm using (python-twitter), has a call:
>
> data = api.GetSearch(term="foo", page=page)
>
> The way it works, you start with page=1.  It returns a list of tweets.
> If the list is empty, there are no more tweets.  If the list is not
> empty, you can try to get more tweets by asking for page=2, page=3, etc.
> I've got:
>
>     page = 1
>     while 1:
>         r = api.GetSearch(term="foo", page=page)
>         if not r:
>             break
>         for tweet in r:
>             process(tweet)
>         page += 1
>
> It works, but it seems excessively fidgety.  Is there some cleaner way
> to refactor this?

I think your code is perfectly readable and clean, but you can flatten
it like so:

    def get_tweets(term, get_page):
        page_nums = itertools.count(1)
        pages = itertools.imap(api.getSearch, page_nums)
        valid_pages = itertools.takewhile(bool, pages)
        tweets = itertools.chain.from_iterable(valid_pages)
        return tweets

[toc] | [prev] | [next] | [standalone]

#33121

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-11-11 08:56 +0100
Message-ID	<mailman.3554.1352620798.27098.python-list@python.org>
In reply to	#33116

Steve Howell, 11.11.2012 04:03:
> On Nov 10, 2:58 pm, Roy Smith <r...@panix.com> wrote:
>> I'm trying to pull down tweets with one of the many twitter APIs.  The
>> particular one I'm using (python-twitter), has a call:
>>
>> data = api.GetSearch(term="foo", page=page)
>>
>> The way it works, you start with page=1.  It returns a list of tweets.
>> If the list is empty, there are no more tweets.  If the list is not
>> empty, you can try to get more tweets by asking for page=2, page=3, etc.
>> I've got:
>>
>>     page = 1
>>     while 1:
>>         r = api.GetSearch(term="foo", page=page)
>>         if not r:
>>             break
>>         for tweet in r:
>>             process(tweet)
>>         page += 1
>>
>> It works, but it seems excessively fidgety.  Is there some cleaner way
>> to refactor this?
> 
> I think your code is perfectly readable and clean, but you can flatten
> it like so:
> 
>     def get_tweets(term, get_page):
>         page_nums = itertools.count(1)
>         pages = itertools.imap(api.getSearch, page_nums)
>         valid_pages = itertools.takewhile(bool, pages)
>         tweets = itertools.chain.from_iterable(valid_pages)
>         return tweets

I'd prefer the original code ten times over this inaccessible beast.

Stefan

[toc] | [prev] | [next] | [standalone]

#33167

From	rusi <rustompmody@gmail.com>
Date	2012-11-11 23:09 -0800
Message-ID	<a61c52b7-eb49-45a5-a4b4-8e4c6b4acaf1@v9g2000pbi.googlegroups.com>
In reply to	#33107

On Nov 11, 3:58 am, Roy Smith <r...@panix.com> wrote:
> I'm trying to pull down tweets with one of the many twitter APIs.  The
> particular one I'm using (python-twitter), has a call:
>
> data = api.GetSearch(term="foo", page=page)
>
> The way it works, you start with page=1.  It returns a list of tweets.
> If the list is empty, there are no more tweets.  If the list is not
> empty, you can try to get more tweets by asking for page=2, page=3, etc.
> I've got:
>
>     page = 1
>     while 1:
>         r = api.GetSearch(term="foo", page=page)
>         if not r:
>             break
>         for tweet in r:
>             process(tweet)
>         page += 1
>
> It works, but it seems excessively fidgety.  Is there some cleaner way
> to refactor this?

This is a classic problem -- structure clash of parallel loops -- nd
Steve Howell has given the classic solution using the fact that
generators in python simulate/implement lazy lists.
As David Beazley http://www.dabeaz.com/coroutines/ explains,
coroutines are more general than generators and you can use those if
you prefer.

The classic problem used to be stated like this:
There is an input in cards of 80 columns.
It needs to be copied onto printer of 132 columns.

The structure clash arises because after reading 80 chars a new card
has to be read; after printing 132 chars a linefeed has to be given.

To pythonize the problem, lets replace the 80,132 by 3,4, ie take the
char-square
abc
def
ghi

and produce
abcd
efgh
i

The important difference (explained nicely by Beazley) is that in
generators the for-loop pulls the generators, in coroutines, the
'generator' pushes the consuming coroutines.

---------------
from __future__ import print_function
s= ["abc", "def", "ghi"]

# Coroutine-infrastructure from pep 342
def consumer(func):
            def wrapper(*args,**kw):
                gen = func(*args, **kw)
                gen.next()
                return gen
            return wrapper

@consumer
def endStage():
    while True:
        for i in range(0,4):
            print((yield), sep='', end='')
        print("\n", sep='', end='')

def genStage(s, target):
    for line in s:
        for i in range(0,3):
            target.send(line[i])

if __name__ == '__main__':
    genStage(s, endStage())

[toc] | [prev] | [next] | [standalone]

#33185

From	rusi <rustompmody@gmail.com>
Date	2012-11-12 07:21 -0800
Message-ID	<bf637b4d-c257-48af-bb3f-b5438f341c67@nl3g2000pbc.googlegroups.com>
In reply to	#33167

On Nov 12, 12:09 pm, rusi <rustompm...@gmail.com> wrote:
> This is a classic problem -- structure clash of parallel loops
<rest snipped>

Sorry wrong solution :D

The fidgetiness is entirely due to python not allowing C-style loops
like these:
>> while ((c=getchar()!= EOF) { ... }


Putting it into coroutine form, it becomes something like the
following [Untested since I dont have the API]. Clearly the
fidgetiness is there as before and now with extra coroutine plumbing

def genStage(term, target):
  page = 1
  while 1:
        r = api.GetSearch(term="foo", page=page)
        if not r:        break
        for tweet in r:  target.send(tweet)
        page += 1


@consumer
def endStage():
    while True:     process((yield))

if __name__ == '__main__':
    genStage("foo", endStage())

[toc] | [prev] | [next] | [standalone]

#33187

From	Peter Otten <__peter__@web.de>
Date	2012-11-12 16:49 +0100
Message-ID	<mailman.3588.1352735375.27098.python-list@python.org>
In reply to	#33185

rusi wrote:

> The fidgetiness is entirely due to python not allowing C-style loops
> like these:
> >>> while ((c=getchar()!= EOF) { ... }

for c in iter(getchar, EOF):
    ...

> Clearly the fidgetiness is there as before and now with extra coroutine
> plumbing

Hmm, very funny...

[toc] | [prev] | [next] | [standalone]

#33188

From	Steve Howell <showell30@yahoo.com>
Date	2012-11-12 08:09 -0800
Message-ID	<c51bc296-f300-4f83-ac12-3f31217ba8fb@n2g2000pbp.googlegroups.com>
In reply to	#33185

On Nov 12, 7:21 am, rusi <rustompm...@gmail.com> wrote:
> On Nov 12, 12:09 pm, rusi <rustompm...@gmail.com> wrote:> This is a classic problem -- structure clash of parallel loops
>
> <rest snipped>
>
> Sorry wrong solution :D
>
> The fidgetiness is entirely due to python not allowing C-style loops
> like these:
>
> >> while ((c=getchar()!= EOF) { ... }
> [...]

There are actually three fidgety things going on:

 1. The API is 1-based instead of 0-based.
 2. You don't know the number of pages in advance.
 3. You want to process tweets, not pages of tweets.

Here's yet another take on the problem:

    # wrap fidgety 1-based api
    def search(i):
        return api.GetSearch("foo", i+1)

    paged_tweets = (search(i) for i in count())

    # handle sentinel
    paged_tweets = iter(paged_tweets.next, [])

    # flatten pages
    tweets = chain.from_iterable(paged_tweets)
    for tweet in tweets:
        process(tweet)

[toc] | [prev] | [next] | [standalone]

#33218

From	rusi <rustompmody@gmail.com>
Date	2012-11-12 20:14 -0800
Message-ID	<bd59d069-1041-40b7-a8eb-30c2d9595467@v9g2000pbi.googlegroups.com>
In reply to	#33188

On Nov 12, 9:09 pm, Steve Howell <showel...@yahoo.com> wrote:
> On Nov 12, 7:21 am, rusi <rustompm...@gmail.com> wrote:
>
> > On Nov 12, 12:09 pm, rusi <rustompm...@gmail.com> wrote:> This is a classic problem -- structure clash of parallel loops
>
> > <rest snipped>
>
> > Sorry wrong solution :D
>
> > The fidgetiness is entirely due to python not allowing C-style loops
> > like these:
>
> > >> while ((c=getchar()!= EOF) { ... }
> > [...]
>
> There are actually three fidgety things going on:
>
>  1. The API is 1-based instead of 0-based.
>  2. You don't know the number of pages in advance.
>  3. You want to process tweets, not pages of tweets.
>
> Here's yet another take on the problem:
>
>     # wrap fidgety 1-based api
>     def search(i):
>         return api.GetSearch("foo", i+1)
>
>     paged_tweets = (search(i) for i in count())
>
>     # handle sentinel
>     paged_tweets = iter(paged_tweets.next, [])
>
>     # flatten pages
>     tweets = chain.from_iterable(paged_tweets)
>     for tweet in tweets:
>         process(tweet)

[Steve Howell]
Nice on the whole -- thanks
Could not the 1-based-ness be dealt with by using count(1)?
ie use
paged_tweets = (api.GetSearch("foo", i) for i in count(1))

{Peter]
> >>> while ((c=getchar()!= EOF) { ... }

for c in iter(getchar, EOF):
    ...

Thanks. Learnt something

[toc] | [prev] | [standalone]

csiph-web

A gnarly little python loop

Contents

#33107 — A gnarly little python loop

#33108

#33109

#33116

#33121

#33167

#33185

#33187

#33188

#33218