Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #51405 > unrolled thread

collections.Counter surprisingly slow

Started byRoy Smith <roy@panix.com>
First post2013-07-28 15:59 -0400
Last post2013-07-30 16:04 +0300
Articles 12 — 6 participants

Back to article view | Back to comp.lang.python


Contents

  collections.Counter surprisingly slow Roy Smith <roy@panix.com> - 2013-07-28 15:59 -0400
    Re: collections.Counter surprisingly slow Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:51 +0000
      Re: collections.Counter surprisingly slow Roy Smith <roy@panix.com> - 2013-07-28 17:57 -0400
      Re: collections.Counter surprisingly slow Stefan Behnel <stefan_ml@behnel.de> - 2013-07-29 13:46 +0200
      Re: collections.Counter surprisingly slow Joshua Landau <joshua@landau.ws> - 2013-07-29 13:07 +0100
    Re: collections.Counter surprisingly slow Serhiy Storchaka <storchaka@gmail.com> - 2013-07-29 09:25 +0300
    Re: collections.Counter surprisingly slow Joshua Landau <joshua@landau.ws> - 2013-07-29 12:49 +0100
    Re: collections.Counter surprisingly slow Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-29 11:19 -0600
    Re: collections.Counter surprisingly slow Serhiy Storchaka <storchaka@gmail.com> - 2013-07-29 22:37 +0300
    Re: collections.Counter surprisingly slow Stefan Behnel <stefan_ml@behnel.de> - 2013-07-30 08:39 +0200
    Re: collections.Counter surprisingly slow Stefan Behnel <stefan_ml@behnel.de> - 2013-07-30 08:51 +0200
    Re: collections.Counter surprisingly slow Serhiy Storchaka <storchaka@gmail.com> - 2013-07-30 16:04 +0300

#51405 — collections.Counter surprisingly slow

FromRoy Smith <roy@panix.com>
Date2013-07-28 15:59 -0400
Subjectcollections.Counter surprisingly slow
Message-ID<roy-8C60F5.15590428072013@news.panix.com>
I've been doing an informal "intro to Python" lunchtime series for some 
co-workers (who are all experienced programmers, in other languages).  
This week I was going to cover list comprehensions, exceptions, and 
profiling. So, I did a little demo showing different ways to build a 
dictionary counting how many times a string appears in some input:

   test() implements a "look before you leap" python loop

   exception() uses a try/catch construct in a similar python loop

   default() uses a defaultdict

   count() uses a Counter

I profiled it, to show how the profiler works.  The full code is below. 
 The input is an 8.8 Mbyte file containing about 570,000 lines (11,000 
unique strings).  Python 2.7.3 on Ubuntu Precise.

As I expected, test() is slower than exception(), which is slower than 
default().  I'm rather shocked to discover that count() is the slowest 
of all!  I expected it to be the fastest.  Or, certainly, no slower 
than default().

The full profiler dump is at the end of this message, but the gist of 
it is:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.322    0.322 ./stations.py:42(count)
    1    0.159    0.159    0.159    0.159 ./stations.py:17(test)
    1    0.114    0.114    0.114    0.114 ./stations.py:27(exception)
    1    0.097    0.097    0.097    0.097 ./stations.py:36(default)

Why is count() [i.e. collections.Counter] so slow?

-------------------------------------------------------------
from collections import defaultdict, Counter

def main():
    lines = open('stations.txt').readlines()

    d1 = test(lines)
    d2 = exception(lines)
    d3 = default(lines)
    d4 = count(lines)

    print d1 == d2
    print d1 == d3
    print d1 == d4

def test(lines):
    d = {}
    for station in lines:
        if station in d:
            d[station] += 1
        else:
            d[station] = 1
    return d


def exception(lines):
    d = {}
    for station in lines:
        try:
            d[station] += 1
        except KeyError:
            d[station] = 1
    return d

def default(lines):
    d = defaultdict(int)
    for station in lines:
        d[station] += 1
    return d

def count(lines):
    d = Counter(lines)
    return d


if __name__ == '__main__':
    import cProfile
    import pstats
    cProfile.run('main()', 'stations.stats')
    p = pstats.Stats('stations.stats')
    p.sort_stats('cumulative').print_stats()
-------------------------------------------------------------

         570335 function calls (570332 primitive calls) in 0.776 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1    0.023    0.023    0.776    0.776 <string>:1(<module>)
   1    0.006    0.006    0.753    0.753 ./stations.py:5(main)
   1    0.000    0.000    0.322    0.322 ./stations.py:42(count)
   1    0.000    0.000    0.322    0.322 /usr/lib/python2.7/collections.py:407(__init__)
   1    0.242    0.242    0.322    0.322 /usr/lib/python2.7/collections.py:470(update)
   1    0.159    0.159    0.159    0.159 ./stations.py:17(test)
   1    0.114    0.114    0.114    0.114 ./stations.py:27(exception)
   1    0.097    0.097    0.097    0.097 ./stations.py:36(default)
   570285    0.080    0.000    0.080    0.000 {method 'get' of 'dict' objects}
   1    0.055    0.055    0.055    0.055 {method 'readlines' of 'file' objects}
   1    0.000    0.000    0.000    0.000 {isinstance}
   1    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/abc.py:128(__instancecheck__)
      2/1    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/abc.py:148(__subclasscheck__)
      3/1    0.000    0.000    0.000    0.000 {issubclass}
        4    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:58(__iter__)
        1    0.000    0.000    0.000    0.000 {open}
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:26(__exit__)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:20(__enter__)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:36(__init__)
        3    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:68(__contains__)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:52(_commit_removals)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:81(add)
        3    0.000    0.000    0.000    0.000 {getattr}
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:16(__init__)
        2    0.000    0.000    0.000    0.000 {method 'remove' of 'set' objects}
        2    0.000    0.000    0.000    0.000 {method '__subclasses__' of 'type' objects}
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_abcoll.py:97(__subclasshook__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}

[toc] | [next] | [standalone]


#51407

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-28 20:51 +0000
Message-ID<51f5843f$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to#51405
On Sun, 28 Jul 2013 15:59:04 -0400, Roy Smith wrote:

[...]
> I'm rather shocked to discover that count() is the slowest
> of all!  I expected it to be the fastest.  Or, certainly, no slower than
> default().
> 
> The full profiler dump is at the end of this message, but the gist of it
> is:
> 
> ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>     1    0.000    0.000    0.322    0.322 ./stations.py:42(count)
>     1    0.159    0.159    0.159    0.159 ./stations.py:17(test)
>     1    0.114    0.114    0.114    0.114 ./stations.py:27(exception)
>     1    0.097    0.097    0.097    0.097 ./stations.py:36(default)
> 
> Why is count() [i.e. collections.Counter] so slow?

It's within a factor of 2 of test, and 3 of exception or default (give or 
take). I don't think that's surprisingly slow. In 2.7, Counter is written 
in Python, while defaultdict has an accelerated C version. I expect that 
has something to do with it.

Calling Counter ends up calling essentially this code:

for elem in iterable:
    self[elem] = self.get(elem, 0) + 1

(although micro-optimized), where "iterable" is your data (lines). 
Calling the get method has higher overhead than dict[key], that will also 
contribute.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#51409

FromRoy Smith <roy@panix.com>
Date2013-07-28 17:57 -0400
Message-ID<roy-4B9DE0.17571228072013@news.panix.com>
In reply to#51407
In article <51f5843f$0$29971$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> > Why is count() [i.e. collections.Counter] so slow?
> 
> It's within a factor of 2 of test, and 3 of exception or default (give or 
> take). I don't think that's surprisingly slow.

It is for a module which describes itself as "High-performance container 
datatypes" :-)

[toc] | [prev] | [next] | [standalone]


#51435

FromStefan Behnel <stefan_ml@behnel.de>
Date2013-07-29 13:46 +0200
Message-ID<mailman.5225.1375098437.3114.python-list@python.org>
In reply to#51407
Steven D'Aprano, 28.07.2013 22:51:
> Calling Counter ends up calling essentially this code:
> 
> for elem in iterable:
>     self[elem] = self.get(elem, 0) + 1
> 
> (although micro-optimized), where "iterable" is your data (lines). 
> Calling the get method has higher overhead than dict[key], that will also 
> contribute.

It comes with a C accelerator (at least in Py3.4dev), but it seems like
that stumbles a bit over its own feet. The accelerator function special
cases the (exact) dict type, but the Counter class is a subtype of dict and
thus takes the generic path, which makes it benefit a bit less than possible.

Look for _count_elements() in

http://hg.python.org/cpython/file/tip/Modules/_collectionsmodule.c

Nevertheless, even the generic C code path looks fast enough in general. I
think the problem is just that the OP used Python 2.7, which doesn't have
this accelerator function.

Stefan

[toc] | [prev] | [next] | [standalone]


#51438

FromJoshua Landau <joshua@landau.ws>
Date2013-07-29 13:07 +0100
Message-ID<mailman.5228.1375099712.3114.python-list@python.org>
In reply to#51407

[Multipart message — attachments visible in raw view] — view raw

On 29 July 2013 12:46, Stefan Behnel <stefan_ml@behnel.de> wrote:

> Steven D'Aprano, 28.07.2013 22:51:
> > Calling Counter ends up calling essentially this code:
> >
> > for elem in iterable:
> >     self[elem] = self.get(elem, 0) + 1
> >
> > (although micro-optimized), where "iterable" is your data (lines).
> > Calling the get method has higher overhead than dict[key], that will also
> > contribute.
>
> It comes with a C accelerator (at least in Py3.4dev), but it seems like
> that stumbles a bit over its own feet. The accelerator function special
> cases the (exact) dict type, but the Counter class is a subtype of dict and
> thus takes the generic path, which makes it benefit a bit less than
> possible.
>
> Look for _count_elements() in
>
> http://hg.python.org/cpython/file/tip/Modules/_collectionsmodule.c
>
> Nevertheless, even the generic C code path looks fast enough in general. I
> think the problem is just that the OP used Python 2.7, which doesn't have
> this accelerator function.
>

# _count_elements({}, items), _count_elements(dict_subclass(), items),
Counter(items), defaultdict(int) loop with exception handling
# "items" is always 1m long with varying levels of repetition

>>> for items in randoms:
... helper.timeit(1), helper_subclass.timeit(1), counter.timeit(1),
default.timeit(1)
...
(0.18816172199876746, 0.4679023139997298, 0.9684444869999425,
0.33518486200046027)
(0.2936601179990248, 0.6056111739999324, 1.1316078849995392,
0.46283868699902087)
(0.35396358400066674, 0.685048443998312, 1.2120939880005608,
0.5497965239992482)
(0.5337620789996436, 0.8658702100001392, 1.4507492869997805,
0.7772859329998028)
(0.745282343999861, 1.1455801379997865, 2.116569702000561,
1.3293145009993168)

:(

I have the helper but Counter is still slow. Is it not getting used for
some reason? It's not even as fast as helper on a dict's (direct, no
overridden methods) subclass.

[toc] | [prev] | [next] | [standalone]


#51418

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-29 09:25 +0300
Message-ID<mailman.5210.1375079167.3114.python-list@python.org>
In reply to#51405
28.07.13 22:59, Roy Smith написав(ла):
>   The input is an 8.8 Mbyte file containing about 570,000 lines (11,000
> unique strings).

Repeat you tests with totally unique lines.

> The full profiler dump is at the end of this message, but the gist of
> it is:

Profiler affects execution time. In particular it slowdown Counter 
implementation which uses more function calls. For real world 
measurement use different approach.

> Why is count() [i.e. collections.Counter] so slow?

Feel free to contribute a patch which fixes this "wart". Note that 
Counter shouldn't be slowdowned on mostly unique data.

[toc] | [prev] | [next] | [standalone]


#51436

FromJoshua Landau <joshua@landau.ws>
Date2013-07-29 12:49 +0100
Message-ID<mailman.5226.1375098641.3114.python-list@python.org>
In reply to#51405

[Multipart message — attachments visible in raw view] — view raw

On 29 July 2013 07:25, Serhiy Storchaka <storchaka@gmail.com> wrote:

> 28.07.13 22:59, Roy Smith написав(ла):
>
>    The input is an 8.8 Mbyte file containing about 570,000 lines (11,000
>> unique strings).
>>
>
> Repeat you tests with totally unique lines.


Counter is about ½ the speed of defaultdict in that case (as opposed to ⅓).


>  The full profiler dump is at the end of this message, but the gist of
>> it is:
>>
>
> Profiler affects execution time. In particular it slowdown Counter
> implementation which uses more function calls. For real world measurement
> use different approach.


Doing some re-times, it seems that his originals for defaultdict, exception
and Counter were about right. I haven't timed the other.


>  Why is count() [i.e. collections.Counter] so slow?
>>
>
> Feel free to contribute a patch which fixes this "wart". Note that Counter
> shouldn't be slowdowned on mostly unique data.


I find it hard to agree that counter should be optimised for the
unique-data case, as surely it's much more oft used when there's a point to
counting?

Also, couldn't Counter just extend from defaultdict?

[toc] | [prev] | [next] | [standalone]


#51466

FromIan Kelly <ian.g.kelly@gmail.com>
Date2013-07-29 11:19 -0600
Message-ID<mailman.5249.1375118411.3114.python-list@python.org>
In reply to#51405
On Mon, Jul 29, 2013 at 5:49 AM, Joshua Landau <joshua@landau.ws> wrote:
> Also, couldn't Counter just extend from defaultdict?

It could, but I expect the C helper function in 3.4 will be faster
since it doesn't even need to call __missing__ in the first place.
And the cost (both in terms of maintenance and run-time) of adding
defaultdict to the Counter MRO just to reuse one tiny __missing__
method seems questionable, especially after taking into account that
the constructor signature is incompatible.

[toc] | [prev] | [next] | [standalone]


#51480

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-29 22:37 +0300
Message-ID<mailman.5256.1375126808.3114.python-list@python.org>
In reply to#51405
29.07.13 20:19, Ian Kelly написав(ла):
> On Mon, Jul 29, 2013 at 5:49 AM, Joshua Landau <joshua@landau.ws> wrote:
>> Also, couldn't Counter just extend from defaultdict?
>
> It could, but I expect the C helper function in 3.4 will be faster
> since it doesn't even need to call __missing__ in the first place.

I'm surprised, but the Counter constructor with commented out import of 
this accelerator is faster (at least for some data).

[toc] | [prev] | [next] | [standalone]


#51545

FromStefan Behnel <stefan_ml@behnel.de>
Date2013-07-30 08:39 +0200
Message-ID<mailman.5302.1375166352.3114.python-list@python.org>
In reply to#51405
Serhiy Storchaka, 29.07.2013 21:37:
> 29.07.13 20:19, Ian Kelly написав(ла):
>> On Mon, Jul 29, 2013 at 5:49 AM, Joshua Landau wrote:
>>> Also, couldn't Counter just extend from defaultdict?
>>
>> It could, but I expect the C helper function in 3.4 will be faster
>> since it doesn't even need to call __missing__ in the first place.
> 
> I'm surprised, but the Counter constructor with commented out import of
> this accelerator is faster (at least for some data).

Read my post. The accelerator doesn't take the fast path for dicts as
Counter is only a subtype of dict, not exactly a dict. That means that it
raises and catches a KeyError exception for each new value that it finds,
and that is apparently more costly than the overhead of calling get().

So, my expectation is that it's faster for highly repetitive data and
slower for mostly unique data.

Maybe a "fast_dict_lookup" option for the accelerator that forces the fast
path would fix this. The Counter class, just like many (most?) other
subtypes of dict, definitely doesn't need the fallback behaviour.

Stefan

[toc] | [prev] | [next] | [standalone]


#51546

FromStefan Behnel <stefan_ml@behnel.de>
Date2013-07-30 08:51 +0200
Message-ID<mailman.5303.1375167121.3114.python-list@python.org>
In reply to#51405
Stefan Behnel, 30.07.2013 08:39:
> Serhiy Storchaka, 29.07.2013 21:37:
>> 29.07.13 20:19, Ian Kelly написав(ла):
>>> On Mon, Jul 29, 2013 at 5:49 AM, Joshua Landau wrote:
>>>> Also, couldn't Counter just extend from defaultdict?
>>>
>>> It could, but I expect the C helper function in 3.4 will be faster
>>> since it doesn't even need to call __missing__ in the first place.
>>
>> I'm surprised, but the Counter constructor with commented out import of
>> this accelerator is faster (at least for some data).
> 
> Read my post. The accelerator doesn't take the fast path for dicts as
> Counter is only a subtype of dict, not exactly a dict. That means that it
> raises and catches a KeyError exception for each new value that it finds,
> and that is apparently more costly than the overhead of calling get().
> 
> So, my expectation is that it's faster for highly repetitive data and
> slower for mostly unique data.
> 
> Maybe a "fast_dict_lookup" option for the accelerator that forces the fast
> path would fix this. The Counter class, just like many (most?) other
> subtypes of dict, definitely doesn't need the fallback behaviour.

Or rather drop the fallback path completely. It's not worth having code
duplication if it's not predictable up-front (before looking at the data)
if it will help or not.

http://bugs.python.org/issue18594

Stefan

[toc] | [prev] | [next] | [standalone]


#51554

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-30 16:04 +0300
Message-ID<mailman.5305.1375189475.3114.python-list@python.org>
In reply to#51405
29.07.13 14:49, Joshua Landau написав(ла):
> I find it hard to agree that counter should be optimised for the
> unique-data case, as surely it's much more oft used when there's a point
> to counting?

Different methods are faster for different data. LBYL approach is best 
for the mostly unique data case, while EAFP approach is best for the 
mostly repeated data case. In general case a performance of particular 
method is a function of its performances in this two extreme cases. When 
it slow for one of extreme case it can be slow in a number of 
intermediate cases.

> Also, couldn't Counter just extend from defaultdict?

Unfortunately this only will slowdown it.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web