Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.031 X-Spam-Evidence: '*H*': 0.94; '*S*': 0.00; 'elif': 0.05; 'great.': 0.07; 'received:209.85.219': 0.09; 'python': 0.11; '236,': 0.16; 'col': 0.16; 'skip:d 60': 0.16; "skip:' 30": 0.19; 'to:name :python-list@python.org': 0.22; 'stick': 0.24; 'file.': 0.24; 'question': 0.24; 'asking': 0.27; 'skip:p 30': 0.29; 'message- id:@mail.gmail.com': 0.30; 'code': 0.31; "skip:' 10": 0.31; 'file': 0.32; 'probably': 0.32; 'received:209.85': 0.35; 'but': 0.35; 'received:google.com': 0.35; '8bit%:9': 0.36; 'curious': 0.36; 'url:org': 0.36; 'list': 0.37; 'received:209': 0.37; 'to:addr:python-list': 0.38; 'files': 0.38; 'to:addr:python.org': 0.39; '8bit%:6': 0.40; 'skip:\xc2 10': 0.60; '8bit%:10': 0.64; 'great': 0.65; 'minutes': 0.67; 'drugs': 0.84 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=QEBiyDp1QO+95MCuRP1JoJJsnw9+Ztk0pEDR7Idq+To=; b=Lge/cqJqAi4B9IYZ9Q3w6RHljAk8RFXRubOuo+CHgsK+H+vSAqacPiMqyCzRTsdJSZ 72WhmeN/hwEd2MY4bhX0075hXW4ofPljTkM4Y9dIg69yCMS+ekoNV9kclmqzM9zkeCdk S/P0DIBEOlxaW7ewR5LLg0RYQqAtcVRmoGBnkhZuw4vSz/AfxdzBmgfapG7P4IpMG2hK i+a0MpqjfoZJRBaMmElQKW0xLCGEU/rFpx5hPdHLi8dvZDmNmhsabGH5ZcyrCdAFTxKP D5S6b2IMrasg9/p2WneYIiwDuDBlg462JqZAtPW4DyYKPFtR32/srRqYbiolTaIzUArU cScA== X-Gm-Message-State: ALoCoQkcTfuN+4SiDO9wgoG1frs3/Zg3XgbS9b53iU8EmNj7f082OToDSbR5Pl3ZA7A+bzqC9nAF X-Received: by 10.60.94.242 with SMTP id df18mr10727875oeb.57.1406761464401; Wed, 30 Jul 2014 16:04:24 -0700 (PDT) MIME-Version: 1.0 From: Vincent Davis Date: Wed, 30 Jul 2014 17:04:04 -0600 Subject: speed up pandas calculation To: "python-list@python.org" Content-Type: multipart/alternative; boundary=089e011828705ea65a04ff712dee X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 230 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1406761473 news.xs4all.nl 2887 [2001:888:2000:d::a6]:59863 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:75389 --089e011828705ea65a04ff712dee Content-Type: text/plain; charset=UTF-8 I know this is a general python list and I am asking about pandas but this question is probably not great for asking on stackoverflow. I have a list of files (~80 files, ~30,000 rows) I need to process with my current code it is take minutes for each file. Any suggestions of a fast way. I am try to stick with pandas for educational purposes. Any suggestions would be great. If you are curious the can find the data file I am using below here. http://www.nber.org/nhamcs/data/nhamcsopd2010.csv drugs_current = {'CITALOPRAM': 4332, 'ESCITALOPRAM': 4812, 'FLUOXETINE': 236, 'FLUVOXAMINE': 3804, 'PAROXETINE': 3157, 'SERTRALINE': 880, 'METHYLPHENIDATE': 900, 'DEXMETHYLPHENIDATE': 4777, 'AMPHETAMINE-DEXTROAMPHETAMINE': 4035, 'DEXTROAMPHETAMINE': 804, 'LISDEXAMFETAMINE': 6663, 'METHAMPHETAMINE': 805, 'ATOMOXETINE': 4827, 'CLONIDINE': 44, 'GUANFACINE': 717} drugs_98_05 = { 'SERTRALINE': 56635, 'CITALOPRAM': 59829, 'FLUOXETINE': 80006, 'PAROXETINE_HCL': 57150, 'FLUVOXAMINE': 57064, 'ESCITALOPRAM': 70466, 'DEXMETHYLPHENIDATE': 70427, 'METHYLPHENIDATE': 70374, 'METHAMPHETAMINE': 53485, 'AMPHETAMINE1': 70257, 'AMPHETAMINE2': 70258, 'AMPHETAMINE3': 50265, 'DEXTROAMPHETAMINE1': 70259, 'DEXTROAMPHETAMINE2': 70260, 'DEXTROAMPHETAMINE3': 51665, 'COMBINATION_PRODUCT': 51380, 'FIXED_COMBINATION': 51381, 'ATOMOXETINE': 70687, 'CLONIDINE1': 51275, 'CLONIDINE2': 70357, 'GUANFACINE': 52498 } df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) col_init = list(df.columns.values) keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] for col in col_init: if col not in keep_col: del df[col] if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98', '99', '00', '91', '02', '03', '04', '05'): drugs = drugs_98_05 elif f[-3:] == 'csv' and f[-6:-4] in ('06', '08', '09', '10'): drugs = drugs_current for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) Vincent Davis 720-301-3003 --089e011828705ea65a04ff712dee Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I know this is a general python list and I am a= sking about pandas but this question is probably not great for asking on st= ackoverflow.
I have a list of files (~80 files, ~30,000 rows) I need to proce= ss with my current code it is take minutes for each file. Any suggestions o= f a fast way. I am try to stick with pandas for educational purposes. Any s= uggestions would be great. If you are curious the can find the data file I = am using below here.=C2=A0http://www.nber.org/nhamcs/data/nhamcsopd2010.csv

drugs_current =3D {'= CITALOPRAM': 4332,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'ESCITALOPRAM': 4812,
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0'FLUOXETINE': 236,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'FLUVOXAMINE': 3804,
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0'PAROXETINE': 3157,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'SERTRALINE': 880,
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0'METHYLPHENIDATE': 900,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'DEXMETHYLPHENIDATE': 4777,
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'AMPHETAMINE-DEXTROAMPHETAMINE': = 4035,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'DEXTROAMPHETAMINE': 804,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'LISDEXAMFETAMINE': 6663,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'METHAMPHETAMINE': 805,
=
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'ATOMOXETINE': 4827,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'CLONIDINE': 44,
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0'GUANFACINE': 717}

<= /font>
drugs_98_05 =3D { 'SERTRALINE': 56635,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 'CITALOPRAM': 59829,
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'FLUOXETINE': 80006,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 'PAROXETINE_HCL': 57150,
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'FLUVOXAMINE': 57064,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 'ESCITALOPRAM': 70466,
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'DEXMETHYLPHENIDATE': 70427,=
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'METHYLPHENIDATE&#= 39;: 70374,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 'METHAMPHETAMINE': 53485,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'AMPHETAMINE1'= : 70257,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 9;AMPHETAMINE2': 70258,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'AMPHETAMINE3'= : 50265,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 9;DEXTROAMPHETAMINE1': 70259,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'DEXTROAMPHETAMINE= 2': 70260,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 'DEXTROAMPHETAMINE3': 51665,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'COMBINATION_PRODU= CT': 51380,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 'FIXED_COMBINATION': 51381,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'ATOMOXETINE':= 70687,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '= ;CLONIDINE1': 51275,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 'CLONIDINE2': = 70357,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '= GUANFACINE': 52498
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}

df =3D pd.read_csv('nhamcsopd2010.csv'= , index_col=3D'PATCODE', low_memory=3DFalse)
col_init =3D list(df.columns.values)
keep_col =3D ['PATCODE', 'PATWT', 'VDAY', 'VMON= TH', 'VYEAR', 'MED1', 'MED2', 'MED3', &= #39;MED4', 'MED5']
for col in col_init:
=C2=A0 =C2=A0 if col not in keep_col:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 del df[col]
if f[-3:] =3D=3D 'csv' and f[-6:-4] in ('93', '94&= #39;, '95', '96', '97', '98', '99',= '00', '91', '02', '03', '04', '= ;05'):
=C2=A0 =C2=A0 drugs =3D drugs_98_05
elif f[-3:] =C2=A0=3D=3D 'csv' and f[-6:-4]= in ('06', '08', '09', '10'):
=C2=A0 =C2=A0 drugs =3D drugs_current
for n in drugs:
=C2=A0 =C2=A0 df[n] =3D df[['MED1','MED2','MED3'= ;,'MED4','MED5']].isin([drugs[n]]).any(1)


Vincent Da= vis
720-301-3003
--089e011828705ea65a04ff712dee--