Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <20140827162157.72e20091@bigbox.christie.dr>
References: <CABRP1o-X3wadeNh-7sf7XvEFGAus-QLQ=FTEab3hRT+y2EzyWA@mail.gmail.com> <20140827162157.72e20091@bigbox.christie.dr>
From: Chris Kaynor <ckaynor@zindagigames.com>
Date: Wed, 27 Aug 2014 14:50:59 -0700
Subject: Re: iterating over strings seems to be really slow?
To: Tim Chase <python.list@tim.thechases.com>
Content-Type: multipart/alternative; boundary=f46d0438edd79057be0501a36b35
Cc: "python-list@python.org" <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.13526.1409176281.18130.python-list@python.org>
Lines: 262
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:77163

--f46d0438edd79057be0501a36b35
Content-Type: text/plain; charset=UTF-8

On Wed, Aug 27, 2014 at 1:53 PM, Rodrick Brown <rodrick.brown@gmail.com>
 wrote:

def wc1():
>     word=""
>     m={}
>     for c in s:
>         if c != " ":
>             word += c
>         else:
>             if m.has_key(word):
>                 m[word] += 1
>             else:
>                 m[word] = 1
>             word=""
>     return(m)
>  def wc2():
>     m={}
>     for c in s.split():
>         if m.has_key(c):
>             m[c] += 1
>         else:
>             m[c] = 1
>     return(m)


On Wed, Aug 27, 2014 at 2:21 PM, Tim Chase <python.list@tim.thechases.com>
wrote:
>
> The thing that surprises me is that using collections.Counter() and
> collections.defaultdict(int) are also slower than your wc2():
>
> from collections import defaultdict, Counter
>
> def wc3():
>     return Counter(s.split())
>
> def wc4():
>     m = defaultdict(int)
>     for c in s.split():
>         m[c] += 1
>     return m
>

I ran a couple more experiments, and, at least on my machine, it has to do
with the number of duplicate words found. I also added two more variations
of my own:

def wc5(s): # I expect this one to be slow.
m = {}
for c in s.split():
 m.setdefault(c, 0)
m[c] += 1
return m

def wc6(s): # This one might be better than any other option presented so
far.
m = {}
 for c in s.split():
try:
m[c] += 1
 except KeyError:
m[c] = 1
return m

I also switched the OP's versions to use the "in" operator rather than
has_key, as I am running Python 3.4.1.

With the same dataset (plus a trailing space) the OP provided, here are my
times: (s = "The black cat jump over the bigger black cat ")

>>> timeit.timeit("wc1(s)", setup=setup, number=1000000)
6.076951338314008
*>>> timeit.timeit("wc2(s)", setup=setup, number=1000000)*
*2.451220378346954*
>>> timeit.timeit("wc3(s)", setup=setup, number=1000000)
5.249674617410577
>>> timeit.timeit("wc4(s)", setup=setup, number=1000000)
3.531042215121076
>>> timeit.timeit("wc5(s)", setup=setup, number=1000000)
3.4734603842861205
>>> timeit.timeit("wc6(s)", setup=setup, number=1000000)
4.322543365103378


When I increase the data set by multipling the OP's string 1000 times (s =
"The black cat jump over the bigger black cat "*1000), here are the times
(I reduced the number of repetitions to keep the time reasonable):

>>> timeit.timeit("wc1(s)", setup=setup, number=1000)
5.807871555058417
>>> timeit.timeit("wc2(s)", setup=setup, number=1000)
2.3245083748933535
*>>> timeit.timeit("wc3(s)", setup=setup, number=1000)*
*1.5722138905703211*
>>> timeit.timeit("wc4(s)", setup=setup, number=1000)
1.901478857657942
>>> timeit.timeit("wc5(s)", setup=setup, number=1000)
3.065888476414475
>>> timeit.timeit("wc6(s)", setup=setup, number=1000)
2.0125233934956217


It seems that with a large number of duplicate words, the counter version
is the best, however with fewer duplicates, the contains check is better.

Chris

--f46d0438edd79057be0501a36b35
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On W=
ed, Aug 27, 2014 at 1:53 PM, Rodrick Brown=C2=A0<span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:rodrick.brown@gmail.com" target=3D"_blank">rodrick.brown@gmail=
.com</a>&gt;</span>=C2=A0wrote:<br>

</div><div class=3D"gmail_quote"><pre style=3D"margin-top:0px;margin-bottom=
:0px;width:748px"><blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left=
-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;paddi=
ng-left:1ex" class=3D"gmail_quote">

<span style=3D"font-weight:bold">def</span> <span style=3D"color:rgb(153,0,=
0);font-weight:bold">wc1</span>():
    word<span style=3D"font-weight:bold">=3D</span><span style=3D"color:rgb=
(221,17,68)">&quot;&quot;</span>
    m<span style=3D"font-weight:bold">=3D</span>{}
    <span style=3D"font-weight:bold">for</span> c <span style=3D"font-weigh=
t:bold">in</span> s:
        <span style=3D"font-weight:bold">if</span> c <span style=3D"font-we=
ight:bold">!=3D</span> <span style=3D"color:rgb(221,17,68)">&quot; &quot;</=
span>:=20
            word <span style=3D"font-weight:bold">+=3D</span> c
        <span style=3D"font-weight:bold">else</span>:
            <span style=3D"font-weight:bold">if</span> m<span style=3D"font=
-weight:bold">.</span>has_key(word):
                m[word] <span style=3D"font-weight:bold">+=3D</span> <span =
style=3D"color:rgb(0,153,153)">1</span>
            <span style=3D"font-weight:bold">else</span>:
                m[word] <span style=3D"font-weight:bold">=3D</span> <span s=
tyle=3D"color:rgb(0,153,153)">1</span>=20
            word<span style=3D"font-weight:bold">=3D</span><span style=3D"c=
olor:rgb(221,17,68)">&quot;&quot;</span>
    <span style=3D"font-weight:bold">return</span>(m)<br>=C2=A0
<span style=3D"font-weight:bold">def</span> <span style=3D"color:rgb(153,0,=
0);font-weight:bold">wc2</span>():=20
    m<span style=3D"font-weight:bold">=3D</span>{}
    <span style=3D"font-weight:bold">for</span> c <span style=3D"font-weigh=
t:bold">in</span> s<span style=3D"font-weight:bold">.</span>split():
        <span style=3D"font-weight:bold">if</span> m<span style=3D"font-wei=
ght:bold">.</span>has_key(c):
            m[c] <span style=3D"font-weight:bold">+=3D</span> <span style=
=3D"color:rgb(0,153,153)">1</span>
        <span style=3D"font-weight:bold">else</span>:
            m[c] <span style=3D"font-weight:bold">=3D</span> <span style=3D=
"color:rgb(0,153,153)">1</span>=20
    <span style=3D"font-weight:bold">return</span>(m)</blockquote></pre></d=
iv><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">On Wed, =
Aug 27, 2014 at 2:21 PM, Tim Chase <span dir=3D"ltr">&lt;<a href=3D"mailto:=
python.list@tim.thechases.com" target=3D"_blank">python.list@tim.thechases.=
com</a>&gt;</span> wrote:<blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex">

The thing that surprises me is that using collections.Counter() and<br>
collections.defaultdict(int) are also slower than your wc2():<br>
<br>
from collections import defaultdict, Counter<br>
<br>
def wc3():<br>
=C2=A0 =C2=A0 return Counter(s.split())<br>
<br>
def wc4():<br>
=C2=A0 =C2=A0 m =3D defaultdict(int)<br>
=C2=A0 =C2=A0 for c in s.split():<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 m[c] +=3D 1<br>
=C2=A0 =C2=A0 return m<br></blockquote></div><br>I ran a couple more experi=
ments, and, at least on my machine, it has to do with the number of duplica=
te words found. I also added two more variations of my own:</div><div class=
=3D"gmail_extra">

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">def wc5(s):=
 # I expect this one to be slow.</div><div class=3D"gmail_extra"><span clas=
s=3D"" style=3D"white-space:pre">	</span>m =3D {}</div><div class=3D"gmail_=
extra"><span class=3D"" style=3D"white-space:pre">	</span>for c in s.split(=
):</div>

<div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">		</s=
pan>m.setdefault(c, 0)</div><div class=3D"gmail_extra"><span class=3D"" sty=
le=3D"white-space:pre">		</span>m[c] +=3D 1</div><div class=3D"gmail_extra"=
><span class=3D"" style=3D"white-space:pre">	</span>return m</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">def wc6(s):=
 # This one might be better than any other option presented so far.</div><d=
iv class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">	</span=
>m =3D {}</div>

<div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">	</sp=
an>for c in s.split():</div><div class=3D"gmail_extra"><span class=3D"" sty=
le=3D"white-space:pre">		</span>try:</div><div class=3D"gmail_extra"><span =
class=3D"" style=3D"white-space:pre">			</span>m[c] +=3D 1</div>

<div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">		</s=
pan>except KeyError:</div><div class=3D"gmail_extra"><span class=3D"" style=
=3D"white-space:pre">			</span>m[c] =3D 1</div><div class=3D"gmail_extra"><=
span class=3D"" style=3D"white-space:pre">	</span>return m</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">I also swit=
ched the OP&#39;s versions to use the &quot;in&quot; operator rather than h=
as_key, as I am running Python 3.4.1.</div><div><br></div><div>With the sam=
e dataset (plus a trailing space) the OP provided, here are my times: (s =
=3D &quot;The black cat jump over the bigger black cat &quot;)</div>

</div><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"=
><div class=3D"gmail_extra"><div><div>&gt;&gt;&gt; timeit.timeit(&quot;wc1(=
s)&quot;, setup=3Dsetup, number=3D1000000)</div><div>6.076951338314008</div=
><div>

<b>&gt;&gt;&gt; timeit.timeit(&quot;wc2(s)&quot;, setup=3Dsetup, number=3D1=
000000)</b></div><div><b>2.451220378346954</b></div><div>&gt;&gt;&gt; timei=
t.timeit(&quot;wc3(s)&quot;, setup=3Dsetup, number=3D1000000)</div><div>5.2=
49674617410577</div>

<div>&gt;&gt;&gt; timeit.timeit(&quot;wc4(s)&quot;, setup=3Dsetup, number=
=3D1000000)</div><div>3.531042215121076</div><div>&gt;&gt;&gt; timeit.timei=
t(&quot;wc5(s)&quot;, setup=3Dsetup, number=3D1000000)</div><div>3.47346038=
42861205</div>

<div>&gt;&gt;&gt; timeit.timeit(&quot;wc6(s)&quot;, setup=3Dsetup, number=
=3D1000000)</div><div>4.322543365103378</div></div></div></blockquote><br>W=
hen I increase the data set by multipling the OP&#39;s string 1000 times (s=
 =3D &quot;The black cat jump over the bigger black cat &quot;*1000), here =
are the times (I reduced the number of repetitions to keep the time reasona=
ble):<blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px">

<div><div>&gt;&gt;&gt; timeit.timeit(&quot;wc1(s)&quot;, setup=3Dsetup, num=
ber=3D1000)</div></div><div><div>5.807871555058417</div></div><div><div>&gt=
;&gt;&gt; timeit.timeit(&quot;wc2(s)&quot;, setup=3Dsetup, number=3D1000)</=
div>

</div><div><div>2.3245083748933535</div></div><div><div><b>&gt;&gt;&gt; tim=
eit.timeit(&quot;wc3(s)&quot;, setup=3Dsetup, number=3D1000)</b></div></div=
><div><div><b>1.5722138905703211</b></div></div><div><div>&gt;&gt;&gt; time=
it.timeit(&quot;wc4(s)&quot;, setup=3Dsetup, number=3D1000)</div>

</div><div><div>1.901478857657942</div></div><div><div>&gt;&gt;&gt; timeit.=
timeit(&quot;wc5(s)&quot;, setup=3Dsetup, number=3D1000)</div></div><div><d=
iv>3.065888476414475</div></div><div><div>&gt;&gt;&gt; timeit.timeit(&quot;=
wc6(s)&quot;, setup=3Dsetup, number=3D1000)</div>

</div><div><div>2.0125233934956217</div></div></blockquote><div><div class=
=3D"gmail_extra"><div><br></div><div>It seems that with a large number of d=
uplicate words, the counter version is the best, however with fewer duplica=
tes, the contains check is better.</div>

</div><div class=3D"gmail_extra"><br clear=3D"all"><div>Chris</div>
</div></div></div>

--f46d0438edd79057be0501a36b35--