Groups > comp.lang.python > #44585 > unrolled thread

Re: Why chunks is not part of the python standard lib?

Started by	Oscar Benjamin <oscar.j.benjamin@gmail.com>
First post	2013-05-01 10:00 +0100
Last post	2013-05-02 14:02 +0000
Articles	8 — 5 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Why chunks is not part of the python standard lib? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-05-01 10:00 +0100
    Re: Why chunks is not part of the python standard lib? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-05-02 05:15 +0000
      Re: Why chunks is not part of the python standard lib? Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2013-05-02 08:53 +0000
      Re: Why chunks is not part of the python standard lib? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-05-02 10:23 +0100
      Re: Why chunks is not part of the python standard lib? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-05-02 13:52 +0100
      Re: Why chunks is not part of the python standard lib? Chris Angelico <rosuav@gmail.com> - 2013-05-02 22:55 +1000
      Re: Why chunks is not part of the python standard lib? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-05-02 14:23 +0100
      Re: Why chunks is not part of the python standard lib? Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2013-05-02 14:02 +0000

#44585 — Re: Why chunks is not part of the python standard lib?

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-05-01 10:00 +0100
Subject	Re: Why chunks is not part of the python standard lib?
Message-ID	<mailman.1213.1367398827.3114.python-list@python.org>

On 1 May 2013 08:10, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> On 01/05/2013 07:26, Ricardo Azpeitia Pimentel wrote:
>>
>> After reading How do you split a list into evenly sized chunks in
>> Python?
>>
>> <http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python>
>>
>> and seeing this kind of mistakes happening
>> https://code.djangoproject.com/ticket/18972 all the time.
>>
>> Why is not a |chunks| function in itertools?
>>
>> |grouper| from
>> http://docs.python.org/2/library/itertools.html#recipes doesn't have the
>> same behavior as |chunks |
>>
>> Example:
>> |
>>
>> |chunks([1,  2,  3,  4,  5],  3)
>> # Should return [[1, 2, 3], [4, 5]] or the iterator equivalent.|
>>
>> |Original Post on StackOverflow:
>>
>> http://stackoverflow.com/questions/16313008/why-chunks-is-not-part-of-the-python-standard-lib
>>
>
> Asked and answered a trillion times.  There's no concensus on how chucks
> should behave.

I'm not sure that's a valid argument against it since a chunks
function could just do a different thing depending on the arguments
given.

The issue is around how to deal with the last chunk if it isn't the
same length as the others and I can only think of 4 reasonable
responses:

1) Yield a shorter chunk
2) Extend the chunk with fill values
3) Raise an error
4) Ignore the last chunk

Cases 2 and 4 can be achieved with current itertools primitives e.g.:
2) izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n)
4) zip(*[iter(iterable)] * n)

However I have only ever had use cases for 1 and 3 and these are not
currently possible without something additional (e.g. a generator
function).

In any case a chunks function can simply take arguments to give all 4
behaviours:

def chunks(iterable, chunksize, uneven='return_short', fillvalue=None):
   # loop through yielding all even chunks
   # and then
   if uneven == 'return_short:
      yield chunk
   elif uneven == 'raise':
      raise ValueError('No items left')
   elif uneven == 'fill':
      yield chunk + [fillvalue] * (chunksize - len(chunk))
   elif uneven == 'ignore':
      pass


Oscar

[toc] | [next] | [standalone]

#44612

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-05-02 05:15 +0000
Message-ID	<5181f679$0$29882$c3e8da3$5496439d@news.astraweb.com>
In reply to	#44585

On Wed, 01 May 2013 10:00:04 +0100, Oscar Benjamin wrote:

> On 1 May 2013 08:10, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
>> On 01/05/2013 07:26, Ricardo Azpeitia Pimentel wrote:
>>>
>>> After reading How do you split a list into evenly sized chunks in
>>> Python?
>>>
>>> <http://stackoverflow.com/questions/312443/how-do-you-split-a-list-
into-evenly-sized-chunks-in-python>
>>>
>>> and seeing this kind of mistakes happening
>>> https://code.djangoproject.com/ticket/18972 all the time.

That bug is irrelevant to the question about chunking a sequence or 
iterator.

>>> Why is not a |chunks| function in itertools?
[...]
>> Asked and answered a trillion times.  There's no concensus on how
>> chucks should behave.
> 
> I'm not sure that's a valid argument against it since a chunks function
> could just do a different thing depending on the arguments given.

Yes, but that's a rubbish API. I'd rather have five separate functions. 
Or maybe five methods on a single Chunker object.

I think the real reasons it's not in the standard library are:

- there's no consensus on what chunking should do;

- hence whatever gets added will disappoint some people;

- unless you add "all of them", in which case you've now got a 
significantly harder API ("there are five chunk functions in itertools, 
which should I use?"); 

- and none of them are actually very hard to write.

So the prospect of adding chunks somewhere is unattractive: lots of angst 
for very little benefit. Yes, it could be done, but none of the Python 
developers, and especially not the itertools "owner", Raymond Hettinger, 
think that the added complication is worth the benefit.

> The issue is around how to deal with the last chunk if it isn't the same
> length as the others and I can only think of 4 reasonable responses:

That's not the only issue. What does chunking (grouping) even mean? Given:

chunk("abcdef", 3)

should I get this?  [abc, def]

or this?  [abc, bcd, cde, def]

There are good use-cases for both.

If given a string, should chunking re-join the individual characters into 
strings, or leave them as lists of chars? Tuples of chars? E.g.

chunk("abcdef", 3) => "abc" ...

or ["a", "b", "c"] ...

How about bytes?

I have opinions on these questions, but I'm not going to give them to 
you. The point is that chunking means different things to different 
people. If you write your own, you get to pick whatever behaviour you 
like, instead of trying to satisfy everyone.

> 1) Yield a shorter chunk
> 2) Extend the chunk with fill values
> 3) Raise an error
> 4) Ignore the last chunk
> 
> Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
> izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
> zip(*[iter(iterable)] * n)
> 
> However I have only ever had use cases for 1 and 3 and these are not
> currently possible without something additional (e.g. a generator
> function).

All of these are trivial. Start with the grouper recipe from the itertools 
documentation, which is your case 2) above, renaming if desired:

http://docs.python.org/2/library/itertools.html#recipes

def chunk_pad(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

Now define:

def chunk_short(n, iterable):  # Case 1) above
    sentinel = object()
    for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
        if sentinel not in chunk:
            yield chunk
        else:
            i = chunk.index(sentinel)
            yield chunk[:i]

def chunk_strict(n, iterable):  # Case 3) above
    sentinel = object()
    for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
        if sentinel in chunk:
            raise ValueError
        yield chunk

def chunk(n, iterable):  # Case 4) above
    args = [iter(iterable)]*n
    return izip(*args)

def chunk_other(n, iterable):  # I suck at thinking up names...
    it = iter(iterable)
    values = [next(it) for _ in range(n)]  # What if this is short?
    while True:
        yield tuple(values)
        values.pop(0)
        try:
            values.append(next(it))
        except StopIteration:
            break

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#44613

From	Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date	2013-05-02 08:53 +0000
Message-ID	<mailman.1227.1367484852.3114.python-list@python.org>
In reply to	#44612

Steven D'Aprano <steve+comp.lang.python <at> pearwood.info> writes:

>
> > 1) Yield a shorter chunk
> > 2) Extend the chunk with fill values
> > 3) Raise an error
> > 4) Ignore the last chunk
> > 
> > Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
> > izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
> > zip(*[iter(iterable)] * n)
> > 
> > However I have only ever had use cases for 1 and 3 and these are not
> > currently possible without something additional (e.g. a generator
> > function).
> 
> All of these are trivial. Start with the grouper recipe from the itertools 
> documentation, which is your case 2) above, renaming if desired:
> 
> http://docs.python.org/2/library/itertools.html#recipes
> 
> def chunk_pad(n, iterable, fillvalue=None):
>     args = [iter(iterable)] * n
>     return izip_longest(fillvalue=fillvalue, *args)
> 
> Now define:
> 
> def chunk_short(n, iterable):  # Case 1) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel not in chunk:
>             yield chunk
>         else:
>             i = chunk.index(sentinel)
>             yield chunk[:i]
> 
> def chunk_strict(n, iterable):  # Case 3) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel in chunk:
>             raise ValueError
>         yield chunk
> 

These are only trivial on the surface. I brought up this topic on
python-ideas just weeks ago and it turns out there's a surprising numbers of
alternate solutions that people use for these two cases. Yours is
straightforward and simple, but comes at the price of the if sentinel clause
being checked repeatedly. An optimized version suggested by Peter Otten
replaces your for loop by:

chunk_pad = zip_longest(*args, fillvalue=fillvalue)
prev = next(chunks)

for chunk in chunk_pad:
    yield prev
    prev = chunk

then doing the sentinel test only once at the end.

>> 1) Yield a shorter chunk
>> 2) Extend the chunk with fill values
>> 3) Raise an error
>> 4) Ignore the last chunk
>> 
>> Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
>> izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
>> zip(*[iter(iterable)] * n)

In my opinion, it would make sense to have the 4 cases suggested by Oscar
covered by itertools. As he says, cases 2 and 4 are already (and there is
the grouper recipe in itertools giving the solution for case 2). It would
prevent people from reinventing (often suboptimal) solutions to these common
problems and it would bring a speed-gain even compared to the best Python
implementations since things would be coded in C.

I would advocate for either of the following two solutions:

a) have an extra 'mode'-type argument for zip_longest() to control its
behavior (default mode could be the current fillvalue padding, 'strict' mode
would raise an error, and 'relaxed' mode would yield the shorter chunk.
or
b) have extra zip_strict and zip_relaxed (I'm also not too good at thinking
up names :)) functions in itertools

Either way, you could now very easily modify the existing grouper recipe in
itertools to implement the four different 'chunks' functions (I would keep
calling them 'grouper' functions in line with the current itertools version).

> I think the real reasons it's not in the standard library are:
> 
> - there's no consensus on what chunking should do;
> 
> - hence whatever gets added will disappoint some people;
> 
> - unless you add "all of them", in which case you've now got a 
> significantly harder API ("there are five chunk functions in itertools, 
> which should I use?");
>
> What does chunking (grouping) even mean? Given:
> 
> chunk("abcdef", 3)
> 
> should I get this?  [abc, def]
> 
> or this?  [abc, bcd, cde, def]

I guess this suggestion would not compromise the API too much, after all,
all these zip versions would still behave like a zip() function should.
Itertools users would also know what a 'grouper' recipe does, i.e., it
doesn't do the fancier alternative stuff you suggest like rejoining groups
of characters obtained from a string. So this would be a relatively
conservative addition.

What do you think?
Wolfgang

[toc] | [prev] | [next] | [standalone]

#44615

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-05-02 10:23 +0100
Message-ID	<mailman.1228.1367486653.3114.python-list@python.org>
In reply to	#44612

On 02/05/2013 09:53, Wolfgang Maier wrote:
>
> What do you think?
> Wolfgang
>

It ain't gonna happen, unless somebody puts Raymond Hettinger in the 
Comfy Chair :)

-- 
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#44620

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-05-02 13:52 +0100
Message-ID	<mailman.1232.1367499164.3114.python-list@python.org>
In reply to	#44612

On 2 May 2013 06:15, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Wed, 01 May 2013 10:00:04 +0100, Oscar Benjamin wrote:
>>
>> I'm not sure that's a valid argument against it since a chunks function
>> could just do a different thing depending on the arguments given.
>
> Yes, but that's a rubbish API. I'd rather have five separate functions.
> Or maybe five methods on a single Chunker object.

Fair enough.

[snip]
> - and none of them are actually very hard to write.

They are all easy to write as generator functions but to me the point
of itertools is that you can do things more efficiently than a
generator function. Otherwise code that uses a combination of
itertools primitives is usually harder to understand than an
equivalent generator function so I'd probably avoid using itertools.

> So the prospect of adding chunks somewhere is unattractive: lots of angst
> for very little benefit. Yes, it could be done, but none of the Python
> developers, and especially not the itertools "owner", Raymond Hettinger,
> think that the added complication is worth the benefit.

It's not necessarily that chunks should be added but that it would be
good if itertools had the necessary primitives to be able to achieve
something like chunks without a generator function. As Wolfgang says
zip() and zip_longest() are only 2 of the 4 obvious types of zipping
and correspond to the 2 of the 4 obvious types of chunks() that can be
done with itertools primitives.

>> The issue is around how to deal with the last chunk if it isn't the same
>> length as the others and I can only think of 4 reasonable responses:
>
> That's not the only issue. What does chunking (grouping) even mean? Given:
>
> chunk("abcdef", 3)
>
> should I get this?  [abc, def]
>
> or this?  [abc, bcd, cde, def]

I don't think many people expect that. In any case that type of
chunking is already possible using itertools.

> There are good use-cases for both.
>
> If given a string, should chunking re-join the individual characters into
> strings, or leave them as lists of chars? Tuples of chars? E.g.
>
> chunk("abcdef", 3) => "abc" ...
>
> or ["a", "b", "c"] ...
>
> How about bytes?

It should be tuples or lists. Anything in itertools should just treat
everything as iterators/iterables and not alter its behaviour for
different types of sequences.

>> 1) Yield a shorter chunk
>> 2) Extend the chunk with fill values
>> 3) Raise an error
>> 4) Ignore the last chunk
>>
>> Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
>> izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
>> zip(*[iter(iterable)] * n)
>>
>> However I have only ever had use cases for 1 and 3 and these are not
>> currently possible without something additional (e.g. a generator
>> function).
>
> All of these are trivial. Start with the grouper recipe from the itertools
> documentation, which is your case 2) above, renaming if desired:
>
> http://docs.python.org/2/library/itertools.html#recipes
>
[snip]
>
> def chunk_short(n, iterable):  # Case 1) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel not in chunk:

The point is it would be good if you could avoid the check in the line
above and the overhead of having an if statement and a generator frame
between every next call. You can do better by writing the line above
as
    if chunk[-1] is not sentinel
but it is still a redundant check. zip_relaxed() would know when
StopIteration was raised and wouldn't need to retrospectively check
every chunk.

>             yield chunk
>         else:
>             i = chunk.index(sentinel)
>             yield chunk[:i]
>
>
> def chunk_strict(n, iterable):  # Case 3) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel in chunk:

if chunk[-1] is sentinel

>             raise ValueError
>         yield chunk
>
>
> def chunk(n, iterable):  # Case 4) above
>     args = [iter(iterable)]*n
>     return izip(*args)
>
>
> def chunk_other(n, iterable):  # I suck at thinking up names...
>     it = iter(iterable)
>     values = [next(it) for _ in range(n)]  # What if this is short?
>     while True:
>         yield tuple(values)
>         values.pop(0)
>         try:
>             values.append(next(it))
>         except StopIteration:
>             break

You can do this one somewhat wastefully with itertools instead of a
generator function:

def chunk_other(n, iterable):
    iterators = tee(iterable, n)
    for n, it in enumerate(iterators):
        for _ in range(n):
            try:
                next(it)
            except StopIteration:
                return iter([])
    return izip(*iterators)

I don't know which of those two would be faster but it's certainly
easier to understand the generator function.


Oscar

[toc] | [prev] | [next] | [standalone]

#44621

From	Chris Angelico <rosuav@gmail.com>
Date	2013-05-02 22:55 +1000
Message-ID	<mailman.1233.1367499359.3114.python-list@python.org>
In reply to	#44612

On Thu, May 2, 2013 at 10:52 PM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> They are all easy to write as generator functions but to me the point
> of itertools is that you can do things more efficiently than a
> generator function. Otherwise code that uses a combination of
> itertools primitives is usually harder to understand than an
> equivalent generator function so I'd probably avoid using itertools.

Aren't most of the itertools primitives written in Python anyway? If
your code is harder to understand, just write the generator function!

ChrisA

[toc] | [prev] | [next] | [standalone]

#44623

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-05-02 14:23 +0100
Message-ID	<mailman.1235.1367501060.3114.python-list@python.org>
In reply to	#44612

On 2 May 2013 13:55, Chris Angelico <rosuav@gmail.com> wrote:
> On Thu, May 2, 2013 at 10:52 PM, Oscar Benjamin
> <oscar.j.benjamin@gmail.com> wrote:
>> They are all easy to write as generator functions but to me the point
>> of itertools is that you can do things more efficiently than a
>> generator function. Otherwise code that uses a combination of
>> itertools primitives is usually harder to understand than an
>> equivalent generator function so I'd probably avoid using itertools.
>
> Aren't most of the itertools primitives written in Python anyway? If
> your code is harder to understand, just write the generator function!

The documentation describes them by showing equivalent generator
functions and there may be a pure Python version of the module but if
you look here then you can see which are builtin for CPython:

http://hg.python.org/cpython/file/c3656dca65e7/Modules/itertoolsmodule.c#l4070

The list covers all of the documented itertools functions (actually I
now realise that they're mostly types not functions).


Oscar

[toc] | [prev] | [next] | [standalone]

#44625

From	Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date	2013-05-02 14:02 +0000
Message-ID	<mailman.1237.1367503348.3114.python-list@python.org>
In reply to	#44612

Oscar Benjamin <oscar.j.benjamin <at> gmail.com> writes:

> 
> On 2 May 2013 13:55, Chris Angelico <rosuav <at> gmail.com> wrote:
> > On Thu, May 2, 2013 at 10:52 PM, Oscar Benjamin
> > <oscar.j.benjamin <at> gmail.com> wrote:
> >> They are all easy to write as generator functions but to me the point
> >> of itertools is that you can do things more efficiently than a
> >> generator function. Otherwise code that uses a combination of
> >> itertools primitives is usually harder to understand than an
> >> equivalent generator function so I'd probably avoid using itertools.
> >
> > Aren't most of the itertools primitives written in Python anyway? If
> > your code is harder to understand, just write the generator function!
> 
> The documentation describes them by showing equivalent generator
> functions and there may be a pure Python version of the module but if
> you look here then you can see which are builtin for CPython:
> 
> http://hg.python.org/cpython/file/c3656dca65e7/Modules/itertoolsmodule.c#l4070
> 
> The list covers all of the documented itertools functions (actually I
> now realise that they're mostly types not functions).
> 
> Oscar
> 

Maybe it's helpful to agree on terminology here.
Looks like Oscar is using *itertools primitives* for things like zip_longest()
- (I'm referring to Python3 here, where the i in zip_longest has disappeared
and the former izip() has been replaced with the built-in zip()) -
, which the docs call "a core set of fast, memory efficient tools that are
useful by themselves or in combination" and that are listed in the doc's
subsection .1 ("Itertools functions"). I think these are written in C (and
explained by giving equivalent Python code).
Then there are the *recipes* (subsection .2), which consist of Python code
that combines and uses *primitves*. Among those is the *grouper recipe*,
which is a chunks() doing fillvalue padding on the last chunk. It's obvious
that you could write an analogous *grouper* that ignores a truncated last
chunk by using zip instead of zip_longest.
Now, Oscar's and my point is that by adding two more *primitives*, call them
zip_strict and zip_relaxed or whatever you like, written in C and behaving
analogous to zip_longest, you could write analogous *grouper recipes* that
raise an error or simply yield when the last chunk is truncated. Those
*recipes* would still have Python code of course, but they would return an
iterator implemented in C.

[toc] | [prev] | [standalone]

csiph-web

Re: Why chunks is not part of the python standard lib?

Contents

#44585 — Re: Why chunks is not part of the python standard lib?

#44612

#44613

#44615

#44620

#44621

#44623

#44625