Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Sender: joshua.landau.ws@gmail.com
In-Reply-To: <kt51t3$r61$1@ger.gmane.org>
References: <roy-8C60F5.15590428072013@news.panix.com> <kt51t3$r61$1@ger.gmane.org>
From: Joshua Landau <joshua@landau.ws>
Date: Mon, 29 Jul 2013 12:49:53 +0100
Subject: Re: collections.Counter surprisingly slow
To: Serhiy Storchaka <storchaka@gmail.com>
Content-Type: multipart/alternative; boundary=001a11c3643894271f04e2a51987
Cc: python-list <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5226.1375098641.3114.python-list@python.org>
Lines: 109
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51436

--001a11c3643894271f04e2a51987
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 29 July 2013 07:25, Serhiy Storchaka <storchaka@gmail.com> wrote:

> 28.07.13 22:59, Roy Smith =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=B2(=D0=
=BB=D0=B0):
>
>    The input is an 8.8 Mbyte file containing about 570,000 lines (11,000
>> unique strings).
>>
>
> Repeat you tests with totally unique lines.


Counter is about =C2=BD the speed of defaultdict in that case (as opposed t=
o =E2=85=93).


>  The full profiler dump is at the end of this message, but the gist of
>> it is:
>>
>
> Profiler affects execution time. In particular it slowdown Counter
> implementation which uses more function calls. For real world measurement
> use different approach.


Doing some re-times, it seems that his originals for defaultdict, exception
and Counter were about right. I haven't timed the other.


>  Why is count() [i.e. collections.Counter] so slow?
>>
>
> Feel free to contribute a patch which fixes this "wart". Note that Counte=
r
> shouldn't be slowdowned on mostly unique data.


I find it hard to agree that counter should be optimised for the
unique-data case, as surely it's much more oft used when there's a point to
counting?

Also, couldn't Counter just extend from defaultdict?

--001a11c3643894271f04e2a51987
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 29 July 2013 07:25, Serhiy Storchaka <span dir=3D"ltr">=
&lt;<a href=3D"mailto:storchaka@gmail.com" target=3D"_blank">storchaka@gmai=
l.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"gma=
il_quote">

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">28.07.13 22:59, Roy Smith =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=
=D0=B0=D0=B2(=D0=BB=D0=B0):<div class=3D"im"><br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">
=C2=A0 The input is an 8.8 Mbyte file containing about 570,000 lines (11,00=
0<br>
unique strings).<br>
</blockquote>
<br></div>
Repeat you tests with totally unique lines.</blockquote><div><br></div><div=
>Counter is about =C2=BD the speed of defaultdict in that case (as opposed =
to=C2=A0=E2=85=93).</div><div>=C2=A0</div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:=
rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class=3D"im">
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">
The full profiler dump is at the end of this message, but the gist of<br>
it is:<br>
</blockquote>
<br></div>
Profiler affects execution time. In particular it slowdown Counter implemen=
tation which uses more function calls. For real world measurement use diffe=
rent approach.</blockquote><div><br></div><div>Doing some re-times, it seem=
s that his originals for defaultdict, exception and Counter were about righ=
t. I haven&#39;t timed the other.</div>

<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-l=
eft-style:solid;padding-left:1ex"><div class=3D"im">
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">
Why is count() [i.e. collections.Counter] so slow?<br>
</blockquote>
<br></div>
Feel free to contribute a patch which fixes this &quot;wart&quot;. Note tha=
t Counter shouldn&#39;t be slowdowned on mostly unique data.</blockquote><d=
iv><br></div><div>I find it hard to agree that counter should be optimised =
for the unique-data case, as surely it&#39;s much more oft used when there&=
#39;s a point to counting?</div>

<div><br></div><div>Also, couldn&#39;t Counter just extend from defaultdict=
?</div></div></div></div>

--001a11c3643894271f04e2a51987--