Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eweka.nl!lightspeed.eweka.nl!194.109.133.87.MISMATCH!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(at': 0.04; 'essentially': 0.04; 'accelerator': 0.09; 'slow.': 0.09; 'subject:skip:c 10': 0.09; 'type,': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'dict': 0.16; 'elem': 0.16; 'general.': 0.16; 'iterable:': 0.16; 'overridden': 0.16; 'reason?': 0.16; 'repetition': 0.16; 'self[elem]': 0.16; 'skip:0 40': 0.16; 'subject:slow': 0.16; 'subtype': 0.16; 'url:file': 0.16; 'exception': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'bit': 0.19; 'stefan': 0.19; 'seems': 0.21; '>>>': 0.22; 'cc:addr:python.org': 0.22; '>>>': 0.24; 'helper': 0.24; 'looks': 0.24; 'cc:2**0': 0.24; '>': 0.26; 'handling': 0.26; 'least': 0.26; 'code:': 0.26; 'header:In-Reply-To:1': 0.27; 'function': 0.29; 'thus': 0.29; "doesn't": 0.30; 'message-id:@mail.gmail.com': 0.30; 'skip:( 20': 0.30; 'code': 0.31; 'getting': 0.31; '(although': 0.31; 'overhead': 0.31; 'steven': 0.31; 'class': 0.32; 'url:python': 0.33; 'cases': 0.33; 'skip:_ 10': 0.34; 'skip:d 20': 0.34; 'problem': 0.35; 'possible.': 0.35; 'but': 0.35; 'received:google.com': 0.35; '8bit%:9': 0.36; 'method': 0.36; 'url:org': 0.36; 'skip:& 10': 0.38; 'ends': 0.38; 'generic': 0.38; 'skip:_ 30': 0.39; 'skip:& 20': 0.39; 'enough': 0.39; 'even': 0.60; 'july': 0.63; 'skip:n 10': 0.64; 'levels': 0.65; 'url:c': 0.67; 'benefit': 0.68; 'special': 0.74; 'feet.': 0.84; 'url:cpython': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type; bh=e7UU35s3hwNYrbAUX7qYQEiqimIi4+M5OEHQsQ9T9OM=; b=lIa6o6l79UyiCb8sDOeChpfS4i9o3WUxE3f12RcKEQIPeYQB/rnTXFoE44xFijPk6X ojCkQY74oPSjbydA5bC+oyebdtmIyx5Ndr0p6VGn47oynI/xVGyonD+UmBznU6Ctx2Pn /QFd1zlDwurm68LOFXM07Rn//CU8qCdlzccpeL9lH1C6HYbu6hlMmBlhv718+WubevXx MfZhG2DpM7afSlt5+ytIEUV1yqFQjZWKMyuqKBzAMEVfnLoy9chOb88nPcWI3RSL/I6U NLn/9M01dgL2dybYyyQpSzs6jixjPjnWPdYYKtNqCoy2CdGrzHUNUKypgDw03HN6sJJI XXew== X-Received: by 10.152.88.78 with SMTP id be14mr26880680lab.19.1375099708558; Mon, 29 Jul 2013 05:08:28 -0700 (PDT) MIME-Version: 1.0 Sender: joshua.landau.ws@gmail.com In-Reply-To: References: <51f5843f$0$29971$c3e8da3$5496439d@news.astraweb.com> From: Joshua Landau Date: Mon, 29 Jul 2013 13:07:48 +0100 X-Google-Sender-Auth: mlvDNJK0SVmPlJCrbJQdaVFWTHk Subject: Re: collections.Counter surprisingly slow To: Stefan Behnel Content-Type: multipart/alternative; boundary=001a11c34da0a8d3f104e2a55915 Cc: python-list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 122 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1375099712 news.xs4all.nl 15965 [2001:888:2000:d::a6]:39689 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:51438 --001a11c34da0a8d3f104e2a55915 Content-Type: text/plain; charset=UTF-8 On 29 July 2013 12:46, Stefan Behnel wrote: > Steven D'Aprano, 28.07.2013 22:51: > > Calling Counter ends up calling essentially this code: > > > > for elem in iterable: > > self[elem] = self.get(elem, 0) + 1 > > > > (although micro-optimized), where "iterable" is your data (lines). > > Calling the get method has higher overhead than dict[key], that will also > > contribute. > > It comes with a C accelerator (at least in Py3.4dev), but it seems like > that stumbles a bit over its own feet. The accelerator function special > cases the (exact) dict type, but the Counter class is a subtype of dict and > thus takes the generic path, which makes it benefit a bit less than > possible. > > Look for _count_elements() in > > http://hg.python.org/cpython/file/tip/Modules/_collectionsmodule.c > > Nevertheless, even the generic C code path looks fast enough in general. I > think the problem is just that the OP used Python 2.7, which doesn't have > this accelerator function. > # _count_elements({}, items), _count_elements(dict_subclass(), items), Counter(items), defaultdict(int) loop with exception handling # "items" is always 1m long with varying levels of repetition >>> for items in randoms: ... helper.timeit(1), helper_subclass.timeit(1), counter.timeit(1), default.timeit(1) ... (0.18816172199876746, 0.4679023139997298, 0.9684444869999425, 0.33518486200046027) (0.2936601179990248, 0.6056111739999324, 1.1316078849995392, 0.46283868699902087) (0.35396358400066674, 0.685048443998312, 1.2120939880005608, 0.5497965239992482) (0.5337620789996436, 0.8658702100001392, 1.4507492869997805, 0.7772859329998028) (0.745282343999861, 1.1455801379997865, 2.116569702000561, 1.3293145009993168) :( I have the helper but Counter is still slow. Is it not getting used for some reason? It's not even as fast as helper on a dict's (direct, no overridden methods) subclass. --001a11c34da0a8d3f104e2a55915 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 29 July 2013 12:46, Stefan Behnel <= ;stefan_ml@behnel.= de> wrote:
Steven D'Aprano, 28.07.2013 22:51:
> Calling Counter ends up calling essentially this code:
>
> for elem in iterable:
> =C2=A0 =C2=A0 self[elem] =3D self.get(elem, 0) + 1
>
> (although micro-optimized), where "iterable" is your data (l= ines).
> Calling the get method has higher overhead than dict[key], that will a= lso
> contribute.

It comes with a C accelerator (at least in Py3.4dev), but it seems li= ke
that stumbles a bit over its own feet. The accelerator function special
cases the (exact) dict type, but the Counter class is a subtype of dict and=
thus takes the generic path, which makes it benefit a bit less than possibl= e.

Look for _count_elements() in

http://hg.python.org/cpython/file/tip/Modules/_collec= tionsmodule.c

Nevertheless, even the generic C code path looks fast enough in general. I<= br> think the problem is just that the OP used Python 2.7, which doesn't ha= ve
this accelerator function.

# _count_ele= ments({}, items), _count_elements(dict_subclass(), items), Counter(items), = defaultdict(int) loop with exception handling
# "items"= is always 1m long with varying levels of repetition

>>> for items in randoms:
... <= span class=3D"" style=3D"white-space:pre"> helper.timeit(1), helper_= subclass.timeit(1), counter.timeit(1), default.timeit(1)
...=C2= =A0
(0.18816172199876746, 0.4679023139997298, 0.9684444869999425, 0.335184= 86200046027)
(0.2936601179990248, 0.6056111739999324, 1.131607884= 9995392, 0.46283868699902087)
(0.35396358400066674, 0.68504844399= 8312, 1.2120939880005608, 0.5497965239992482)
(0.5337620789996436, 0.8658702100001392, 1.4507492869997805, 0.7772859= 329998028)
(0.745282343999861, 1.1455801379997865, 2.116569702000= 561, 1.3293145009993168)

:(

I have the helper but Counter is still slow. Is it not getting u= sed for some reason? It's not even as fast as helper on a dict's (d= irect, no overridden methods) subclass.
--001a11c34da0a8d3f104e2a55915--