Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CALyJZZXMmSk8L7+P7erQSp2W4EpkbxiXmJkY-+5svfK5C3Z0kw@mail.gmail.com>
References: <CALyJZZXMmSk8L7+P7erQSp2W4EpkbxiXmJkY-+5svfK5C3Z0kw@mail.gmail.com>
Date: Wed, 30 Jul 2014 18:57:46 -0500
Subject: Re: speed up pandas calculation
From: Skip Montanaro <skip.montanaro@gmail.com>
To: Vincent Davis <vincent@vincentdavis.net>
Content-Type: multipart/alternative; boundary=485b397dd70141177804ff71ecad
Cc: Python <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.12448.1406764670.18130.python-list@python.org>
Lines: 57
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:75391

--485b397dd70141177804ff71ecad
Content-Type: text/plain; charset=UTF-8

> df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
low_memory=False)
> col_init = list(df.columns.values)
> keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
'MED2', 'MED3', 'MED4', 'MED5']
> for col in col_init:
>     if col not in keep_col:
>         del df[col]

I'm no pandas expert, but a couple things come to mind. First, where is
your code slow (profile it, even with a few well-placed prints)? If it's in
read_csv there might be little you can do unless you load those data
repeatedly, and can save a pickled data frame as a caching measure. Second,
you loop over columns deciding one by one whether to keep or toss a column.
Instead try

df = df[keep_col]

Third, if deleting those other columns is costly, can you perhaps just
ignore them?

Can't be more investigative right now. I don't have pandas on Android. :-)

Skip

--485b397dd70141177804ff71ecad
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">&gt; df =3D pd.read_csv(&#39;nhamcsopd2010.csv&#39; , index_=
col=3D&#39;PATCODE&#39;, low_memory=3DFalse)<br>
&gt; col_init =3D list(df.columns.values)<br>
&gt; keep_col =3D [&#39;PATCODE&#39;, &#39;PATWT&#39;, &#39;VDAY&#39;, &#39=
;VMONTH&#39;, &#39;VYEAR&#39;, &#39;MED1&#39;, &#39;MED2&#39;, &#39;MED3=
9;, &#39;MED4&#39;, &#39;MED5&#39;]<br>
&gt; for col in col_init:<br>
&gt; =C2=A0 =C2=A0 if col not in keep_col:<br>
&gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 del df[col]</p>
<p dir=3D"ltr">I&#39;m no pandas expert, but a couple things come to mind. =
First, where is your code slow (profile it, even with a few well-placed pri=
nts)? If it&#39;s in read_csv there might be little you can do unless you l=
oad those data repeatedly, and can save a pickled data frame as a caching m=
easure. Second, you loop over columns deciding one by one whether to keep o=
r toss a column. Instead try<br>

 <br>
df =3D df[keep_col]</p>
<p dir=3D"ltr"> Third, if deleting those other columns is costly, can you p=
erhaps just ignore them? </p>
<p dir=3D"ltr">Can&#39;t be more investigative right now. I don&#39;t have =
pandas on Android. :-)</p>
<p dir=3D"ltr">Skip</p>

--485b397dd70141177804ff71ecad--