Groups > comp.lang.python > #43957 > unrolled thread

itertools.groupby

Started by	Jason Friedman <jsf80238@gmail.com>
First post	2013-04-20 11:09 -0600
Last post	2013-04-23 01:14 +1000
Articles	7 — 6 participants

Back to article view | Back to comp.lang.python

  itertools.groupby Jason Friedman <jsf80238@gmail.com> - 2013-04-20 11:09 -0600
    Re: itertools.groupby Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-21 00:13 +0000
      Re: itertools.groupby Joshua Landau <joshua.landau.ws@gmail.com> - 2013-04-22 04:09 +0100
    Re: itertools.groupby Neil Cerutti <neilc@norwich.edu> - 2013-04-22 14:24 +0000
      Re: itertools.groupby Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 15:49 +0100
        Re: itertools.groupby Neil Cerutti <neilc@norwich.edu> - 2013-04-22 15:04 +0000
      Re: itertools.groupby Chris Angelico <rosuav@gmail.com> - 2013-04-23 01:14 +1000

#43957 — itertools.groupby

From	Jason Friedman <jsf80238@gmail.com>
Date	2013-04-20 11:09 -0600
Subject	itertools.groupby
Message-ID	<mailman.855.1366477790.3114.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

I wrote this:
------------------------------------
#!/usr/bin/python3
from itertools import groupby

def get_lines_from_file(file_name):
    with open(file_name) as reader:
        for line in reader.readlines():
            yield(line.strip())

counter = 0
def key_func(x):
    if x.startswith("Starting a new group"):
        global counter
        counter += 1
    return counter

for key, group in groupby(get_lines_from_file("my_data"), key_func):
    print(list(group)[1:])
------------------------------------

I get the output I desire, but I'm wondering if there is a solution without
the global counter.

[toc] | [next] | [standalone]

#43976

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-04-21 00:13 +0000
Message-ID	<51732f27$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to	#43957

On Sat, 20 Apr 2013 11:09:42 -0600, Jason Friedman wrote:

> I have a file such as:
> 
> $ cat my_data
> Starting a new group
> a
> b
> c
> Starting a new group
> 1
> 2
> 3
> 4
> Starting a new group
> X
> Y
> Z
> Starting a new group
> 
> I am wanting a list of lists:
> ['a', 'b', 'c']
> ['1', '2', '3', '4']
> ['X', 'Y', 'Z']
> []
> 
> I wrote this:
[...]
> I get the output I desire, but I'm wondering if there is a solution
> without the global counter.


I wouldn't use groupby. It's a hammer, not every grouping job is a nail.

Instead, use a simple accumulator:


def group(lines):
    accum = []
    for line in lines:
        line = line.strip()
        if line == 'Starting a new group':
            if accum:  # Don't bother if there are no accumulated lines.
                yield accum
                accum = []
        else:
            accum.append(line)
    # Don't forget the last group of lines.
    if accum: yield accum




-- 
Steven

[toc] | [prev] | [next] | [standalone]

#44030

From	Joshua Landau <joshua.landau.ws@gmail.com>
Date	2013-04-22 04:09 +0100
Message-ID	<mailman.895.1366600191.3114.python-list@python.org>
In reply to	#43976

[Multipart message — attachments visible in raw view] — view raw

On 21 April 2013 01:13, Steven D'Aprano <
steve+comp.lang.python@pearwood.info> wrote:

> I wouldn't use groupby. It's a hammer, not every grouping job is a nail.
>
> Instead, use a simple accumulator:
>
>
> def group(lines):
>     accum = []
>     for line in lines:
>         line = line.strip()
>         if line == 'Starting a new group':
>             if accum:  # Don't bother if there are no accumulated lines.
>                 yield accum
>                 accum = []
>         else:
>             accum.append(line)
>     # Don't forget the last group of lines.
>     if accum: yield accum
>

Whilst yours is the simplest bar Dennis Lee Bieber's and nicer in that it
yields, neither of yours work for empty groups properly.

I recommend the simple change:

def group(lines):
    accum = None
    for line in lines:
        line = line.strip()
        if line == 'Starting a new group':
            if accum is not None:  # Don't bother if there are no
accumulated lines.
                yield accum
            accum = []
        else:
            accum.append(line)
    # Don't forget the last group of lines.
    yield accum

But will recommend my own small twist (because I think it is clever):

def group(lines):
lines = (line.strip() for line in lines)

if next(lines) != "Starting a new group":
 raise ValueError("First line must be 'Starting a new group'")

while True:
 acum = []

for line in lines:
if line == "Starting a new group":
 break

acum.append(line)

else:
 yield acum
break

yield acum

[toc] | [prev] | [next] | [standalone]

#44075

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-04-22 14:24 +0000
Message-ID	<atkvgbFto6uU1@mid.individual.net>
In reply to	#43957

On 2013-04-20, Jason Friedman <jsf80238@gmail.com> wrote:
> I have a file such as:
>
> $ cat my_data
> Starting a new group
> a
> b
> c
> Starting a new group
> 1
> 2
> 3
> 4
> Starting a new group
> X
> Y
> Z
> Starting a new group
>
> I am wanting a list of lists:
> ['a', 'b', 'c']
> ['1', '2', '3', '4']
> ['X', 'Y', 'Z']
> []

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
    b = lst.index(header) + 1
    while True:
        try:
            e = lst.index(header, b)
        except ValueError:
            yield lst[b:]
            break
        yield lst[b:e]
        b = e+1

for group in headered_groups([line.strip() for line in open('data.txt')],
        "Starting a new group"):
    print(group)

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#44080

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-04-22 15:49 +0100
Message-ID	<mailman.919.1366642212.3114.python-list@python.org>
In reply to	#44075

On 22 April 2013 15:24, Neil Cerutti <neilc@norwich.edu> wrote:
>
> Hrmmm, hoomm. Nobody cares for slicing any more.
>
> def headered_groups(lst, header):
>     b = lst.index(header) + 1
>     while True:
>         try:
>             e = lst.index(header, b)
>         except ValueError:
>             yield lst[b:]
>             break
>         yield lst[b:e]
>         b = e+1

This requires the whole file to be read into memory. Iterators are
typically preferred over list slicing for sequential text file access
since you can avoid loading the whole file at once. This means that
you can process a large file while only using a constant amount of
memory.

>
> for group in headered_groups([line.strip() for line in open('data.txt')],
>         "Starting a new group"):
>     print(group)

The list comprehension above loads the entire file into memory.
Assuming that .strip() is just being used to remove the newline at the
end it would be better to just use the readlines() method since that
loads everything into memory and removes the newlines. To remove them
without reading everything you can use map (or imap in Python 2):

with open('data.txt') as inputfile:
    for group in headered_groups(map(str.strip, inputfile)):
        print(group)

Oscar

[toc] | [prev] | [next] | [standalone]

#44085

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-04-22 15:04 +0000
Message-ID	<atl1skFto6uU3@mid.individual.net>
In reply to	#44080

On 2013-04-22, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> On 22 April 2013 15:24, Neil Cerutti <neilc@norwich.edu> wrote:
>>
>> Hrmmm, hoomm. Nobody cares for slicing any more.
>>
>> def headered_groups(lst, header):
>>     b = lst.index(header) + 1
>>     while True:
>>         try:
>>             e = lst.index(header, b)
>>         except ValueError:
>>             yield lst[b:]
>>             break
>>         yield lst[b:e]
>>         b = e+1
>
> This requires the whole file to be read into memory. Iterators
> are typically preferred over list slicing for sequential text
> file access since you can avoid loading the whole file at once.
> This means that you can process a large file while only using a
> constant amount of memory.

I agree, but this application processes unknowns-sized slices,
you have to build lists anyhow. I find slicing much more
convenient than accumulating in this case, but it's possibly a
tradeoff.

> with open('data.txt') as inputfile:
>     for group in headered_groups(map(str.strip, inputfile)):
>         print(group)

Thanks, that's a nice improvement.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#44087

From	Chris Angelico <rosuav@gmail.com>
Date	2013-04-23 01:14 +1000
Message-ID	<mailman.925.1366643689.3114.python-list@python.org>
In reply to	#44075

On Tue, Apr 23, 2013 at 12:49 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> Iterators are
> typically preferred over list slicing for sequential text file access
> since you can avoid loading the whole file at once. This means that
> you can process a large file while only using a constant amount of
> memory.

And, perhaps even more importantly, allows you to pipe text in and
out. Obviously some operations (eg grep) lend themselves better to
this than do others (eg sort), but with this it ought at least to
output each group as it comes.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

itertools.groupby

Contents

#43957 — itertools.groupby

#43976

#44030

#44075

#44080

#44085

#44087