Groups > comp.lang.python > #75389 > unrolled thread

speed up pandas calculation

Started by	Vincent Davis <vincent@vincentdavis.net>
First post	2014-07-30 17:04 -0600
Last post	2014-07-31 00:57 +0000
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  speed up pandas calculation Vincent Davis <vincent@vincentdavis.net> - 2014-07-30 17:04 -0600
    Re: speed up pandas calculation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-07-31 00:57 +0000

#75389 — speed up pandas calculation

From	Vincent Davis <vincent@vincentdavis.net>
Date	2014-07-30 17:04 -0600
Subject	speed up pandas calculation
Message-ID	<mailman.12446.1406761473.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

I know this is a general python list and I am asking about pandas but this
question is probably not great for asking on stackoverflow.
I have a list of files (~80 files, ~30,000 rows) I need to process with my
current code it is take minutes for each file. Any suggestions of a fast
way. I am try to stick with pandas for educational purposes. Any
suggestions would be great. If you are curious the can find the data file I
am using below here. http://www.nber.org/nhamcs/data/nhamcsopd2010.csv

drugs_current = {'CITALOPRAM': 4332,
         'ESCITALOPRAM': 4812,
         'FLUOXETINE': 236,
         'FLUVOXAMINE': 3804,
         'PAROXETINE': 3157,
         'SERTRALINE': 880,
         'METHYLPHENIDATE': 900,
         'DEXMETHYLPHENIDATE': 4777,
         'AMPHETAMINE-DEXTROAMPHETAMINE': 4035,
         'DEXTROAMPHETAMINE': 804,
         'LISDEXAMFETAMINE': 6663,
         'METHAMPHETAMINE': 805,
         'ATOMOXETINE': 4827,
         'CLONIDINE': 44,
         'GUANFACINE': 717}

drugs_98_05 = { 'SERTRALINE': 56635,
                'CITALOPRAM': 59829,
                'FLUOXETINE': 80006,
                'PAROXETINE_HCL': 57150,
                'FLUVOXAMINE': 57064,
                'ESCITALOPRAM': 70466,
                'DEXMETHYLPHENIDATE': 70427,
                'METHYLPHENIDATE': 70374,
                'METHAMPHETAMINE': 53485,
                'AMPHETAMINE1': 70257,
                'AMPHETAMINE2': 70258,
                'AMPHETAMINE3': 50265,
                'DEXTROAMPHETAMINE1': 70259,
                'DEXTROAMPHETAMINE2': 70260,
                'DEXTROAMPHETAMINE3': 51665,
                'COMBINATION_PRODUCT': 51380,
                'FIXED_COMBINATION': 51381,
                'ATOMOXETINE': 70687,
                'CLONIDINE1': 51275,
                'CLONIDINE2': 70357,
                'GUANFACINE': 52498
               }

df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
low_memory=False)
col_init = list(df.columns.values)
keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2',
'MED3', 'MED4', 'MED5']
for col in col_init:
    if col not in keep_col:
        del df[col]
if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
'99', '00', '91', '02', '03', '04', '05'):
    drugs = drugs_98_05
elif f[-3:]  == 'csv' and f[-6:-4] in ('06', '08', '09', '10'):
    drugs = drugs_current
for n in drugs:
    df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)


Vincent Davis
720-301-3003

[toc] | [next] | [standalone]

#75392

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-07-31 00:57 +0000
Message-ID	<53d99497$0$29974$c3e8da3$5496439d@news.astraweb.com>
In reply to	#75389

On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote:

> I know this is a general python list and I am asking about pandas but
> this question is probably not great for asking on stackoverflow. I have
> a list of files (~80 files, ~30,000 rows) I need to process with my
> current code it is take minutes for each file. 

Hmmm, is that 30,000 rows per file, or 30,000 files in total?

Not that it really matters, I shouldn't expect that it makes that much 
difference either way.

> Any suggestions of a fast
> way. I am try to stick with pandas for educational purposes. Any
> suggestions would be great. If you are curious the can find the data
> file I am using below here.
> http://www.nber.org/nhamcs/data/nhamcsopd2010.csv

For brevity, I've trimmed back the dictionaries to something smaller. 
That's just for ease of reading.

> drugs_current = {'CITALOPRAM': 4332,
>          'ESCITALOPRAM': 4812,
           [...]
>          'CLONIDINE': 44,
>          'GUANFACINE': 717}
> 
> drugs_98_05 = { 'SERTRALINE': 56635,
>                 'CITALOPRAM': 59829,
                  [...]
>                 'CLONIDINE2': 70357,
>                 'GUANFACINE': 52498
>                }
> 
> df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
>                  low_memory=False)
> col_init = list(df.columns.values)
> keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
>             'MED2', 'MED3', 'MED4', 'MED5']
> for col in col_init:
>     if col not in keep_col:
>         del df[col]

I expect that this could be your problem here. Deleting from the start or 
middle of lists is slow, and df may be a list or at least have list-like 
performance. Suppose you have a list like this:

['a', 'b', 'c', 'd', 'e', ..., 'zzz']

that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you 
delete item 0, 'a':

=> ['b', 'c', 'd', 'e', ..., 'zzz']

Python has to move the remaining 18278 items across one space. Then you 
delete 'b':

=> ['c', 'd', 'e', ..., 'zzz']

Python has to move the remaining 18276 items across one space, making a 
total of 36559 moves. And that's just to delete two items. Roughly 
speaking, if you end up deleting N items from a list starting from the 
front, Python may have to move as many as N**2 items into their final 
positions. If you have 5 or 10 columns, that's not too bad, but if you 
have (say) 80 columns, and delete 70 of them, that could be *horribly* 
expensive.

If you must *repeatedly* use del on lists, it's best to ensure you're 
deleting from the end, not the start. But even better, and this applies 
to anything not just lists, is not to delete at all, but to create a new 
list, copying the columns you want, rather than deleting the columns you 
don't want.

I'm not familiar with pandas and am not sure about the exact syntax 
needed, but something like:

new_df = []  # Assuming df is a list.
for col in df:
    if col.value in keep_col:
        new_df.append(col)

> if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
> '99', '00', '91', '02', '03', '04', '05'):

Where does f come from? You haven't shown the definition of that.

-- 
Steven

[toc] | [prev] | [standalone]

csiph-web

speed up pandas calculation

Contents

#75389 — speed up pandas calculation

#75392