Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CANc-5UzCLT8ZPF2PCDMQyOqr0gqrDjFGnsRAoFBfGVmCh5-0jg@mail.gmail.com>
References: <CALyJZZXMmSk8L7+P7erQSp2W4EpkbxiXmJkY-+5svfK5C3Z0kw@mail.gmail.com> <CANc-5UzCLT8ZPF2PCDMQyOqr0gqrDjFGnsRAoFBfGVmCh5-0jg@mail.gmail.com>
From: Vincent Davis <vincent@vincentdavis.net>
Date: Wed, 30 Jul 2014 18:28:28 -0600
Subject: Re: speed up pandas calculation
To: Skip Montanaro <skip.montanaro@gmail.com>
Content-Type: multipart/alternative; boundary=089e0158aba23fb24004ff725b5e
Cc: Python <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.12458.1406785998.18130.python-list@python.org>
Lines: 118
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:75404

--089e0158aba23fb24004ff725b5e
Content-Type: text/plain; charset=UTF-8

On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro <skip.montanaro@gmail.com>
wrote:

> > df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
> low_memory=False)
> > col_init = list(df.columns.values)
> > keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
> 'MED2', 'MED3', 'MED4', 'MED5']
> > for col in col_init:
> >     if col not in keep_col:
> >         del df[col]
>
> I'm no pandas expert, but a couple things come to mind. First, where is
> your code slow (profile it, even with a few well-placed prints)? If it's in
> read_csv there might be little you can do unless you load those data
> repeatedly, and can save a pickled data frame as a caching measure. Second,
> you loop over columns deciding one by one whether to keep or toss a column.
> Instead try
>
> df = df[keep_col]
>
> Third, if deleting those other columns is costly, can you perhaps just
> ignore them?
>
> Can't be more investigative right now. I don't have pandas on Android. :-)
>

So the df = df[keep_col] is not fast but it is not that slow. You made me
think of a solution to that part. just slice and copy. The only gotya is
that the keep_col have to actually exist
 keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3',
'MED4', 'MED5']
df = df[keep_col]

The real slow part seems to be
for n in drugs:
    df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)



Vincent Davis
720-301-3003

--089e0158aba23fb24004ff725b5e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">=
On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:skip.montanaro@gmail.com" target=3D"_blank">skip.montanaro@gma=
il.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D""><p dir=3D"ltr">&gt; df =3D pd.read_csv(=
9;nhamcsopd2010.csv&#39; , index_col=3D&#39;PATCODE&#39;, low_memory=3DFals=
e)<br>


&gt; col_init =3D list(df.columns.values)<br>
&gt; keep_col =3D [&#39;PATCODE&#39;, &#39;PATWT&#39;, &#39;VDAY&#39;, &#39=
;VMONTH&#39;, &#39;VYEAR&#39;, &#39;MED1&#39;, &#39;MED2&#39;, &#39;MED3=
9;, &#39;MED4&#39;, &#39;MED5&#39;]<br>
&gt; for col in col_init:<br>
&gt; =C2=A0 =C2=A0 if col not in keep_col:<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 del df[col]</p>
</div><p dir=3D"ltr">I&#39;m no pandas expert, but a couple things come to =
mind. First, where is your code slow (profile it, even with a few well-plac=
ed prints)? If it&#39;s in read_csv there might be little you can do unless=
 you load those data repeatedly, and can save a pickled data frame as a cac=
hing measure. Second, you loop over columns deciding one by one whether to =
keep or toss a column. Instead try<br>



 <br>
df =3D df[keep_col]</p>
<p dir=3D"ltr"> Third, if deleting those other columns is costly, can you p=
erhaps just ignore them? </p>
<p dir=3D"ltr">Can&#39;t be more investigative right now. I don&#39;t have =
pandas on Android. :-)</p></blockquote></div><br><div class=3D"gmail_defaul=
t"><font face=3D"verdana, sans-serif">So the=C2=A0</font><font face=3D"aria=
l, sans-serif">df =3D df[keep_col] is not fast but it is not that slow. You=
 made me think of a solution to that part. just slice and copy. The only go=
tya is that the keep_col have to=C2=A0actually=C2=A0exist</font></div>

<div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif"><span=
 style=3D"font-family:arial,sans-serif;font-size:13.333333969116211px"><div=
 class=3D"gmail_default" style=3D"font-family:verdana,sans-serif;font-size:=
small">

keep_col =3D [&#39;PATCODE&#39;, &#39;PATWT&#39;, &#39;VDAYR&#39;, &#39;VMO=
NTH&#39;, &#39;MED1&#39;, &#39;MED2&#39;, &#39;MED3&#39;, &#39;MED4&#39;, &=
#39;MED5&#39;]</div><div class=3D"gmail_default" style=3D"font-family:verda=
na,sans-serif;font-size:small">

df =3D df[keep_col]</div><div class=3D"gmail_default" style=3D"font-family:=
verdana,sans-serif;font-size:small"><br></div><div class=3D"gmail_default" =
style=3D"font-family:verdana,sans-serif;font-size:small">The real slow part=
 seems to be</div>

<div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif;font-s=
ize:small"><div class=3D"gmail_default">for n in drugs:</div><div class=3D"=
gmail_default">=C2=A0 =C2=A0 df[n] =3D df[[&#39;MED1&#39;,&#39;MED2&#39;,&#=
39;MED3&#39;,&#39;MED4&#39;,&#39;MED5&#39;]].isin([drugs[n]]).any(1)</div>

</div><div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif;=
font-size:small"><br></div><div><br></div></span></div><br clear=3D"all"><d=
iv><div>Vincent Davis</div><div>720-301-3003<span></span><span></span></div=
>
</div>

</div></div>

--089e0158aba23fb24004ff725b5e--