Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
References: <mailman.465.1431559626.12865.python-list@python.org> <5553f8fe$0$13012$c3e8da3$5496439d@news.astraweb.com> <5554D40C.9090505@pacbell.net> <20150514121759.73f98b76@bigbox.christie.dr>
In-Reply-To: <20150514121759.73f98b76@bigbox.christie.dr>
From: Ziqi Xiong <xiongziqi84@gmail.com>
Date: Fri, 15 May 2015 03:31:34 +0000
Subject: Re: Looking for direction
To: Tim Chase <python.list@tim.thechases.com>, "20/20 Lab" <lab@pacbell.net>
Cc: python-list@python.org
Content-Type: multipart/alternative; boundary=001a1138165829165a0516167b37
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.29.1431674927.17265.python-list@python.org>
Lines: 205
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:90654

--001a1138165829165a0516167b37
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

maybe we can change this list to dict, using item[0] and item[1] as keys,
the whole item as value . then you can update by the same key i think
Tim Chase <python.list@tim.thechases.com>=E4=BA=8E2015=E5=B9=B45=E6=9C=8815=
=E6=97=A5 =E5=91=A8=E4=BA=9401:17=E5=86=99=E9=81=93=EF=BC=9A

> On 2015-05-14 09:57, 20/20 Lab wrote:
> > On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
> >>> I have a LARGE csv file that I need to process.  110+ columns,
> >>> 72k rows.  I managed to write enough to reduce it to a few
> >>> hundred rows, and the five columns I'm interested in.
> > I actually stumbled across the csv module after coding enough to
> > make a list of lists.  So that is more the reason I approached the
> > list; Nothing like spending hours (or days) coding something that
> > already exists and just dont know about.
> >>> Now is were I have my problem:
> >>>
> >>> myList =3D [ [123, "XXX", "Item", "Qty", "Noise"],
> >>>              [72976, "YYY", "Item", "Qty", "Noise"],
> >>>              [123, "XXX" "ItemTypo", "Qty", "Noise"]    ]
> >>>
> >>> Basically, I need to check for rows with duplicate accounts
> >>> row[0] and staff (row[1]), and if so, remove that row, and add
> >>> it's Qty to the original row. I really dont have a clue how to
> >>> go about this.
> >>
> >> processed =3D {}  # hold the processed data in a dict
> >>
> >> for row in myList:
> >>      account, staff =3D row[0:2]
> >>      key =3D (account, staff)  # Put them in a tuple.
> >>      if key in processed:
> >>          # We've already seen this combination.
> >>          processed[key][3] +=3D row[3]  # Add the quantities.
> >>      else:
> >>          # Never seen this combination before.
> >>          processed[key] =3D row
> >>
> >> newlist =3D list(processed.values())
> >>
> > It does, immensely.  I'll make this work.  Thank you again for the
> > link from yesterday and apologies for hitting the wrong reply
> > button.  I'll have to study more on the usage and implementations
> > of dictionaries and tuples.
>
> In processing the initial CSV file, I suspect that using a
> csv.DictReader would make the code a bit cleaner.  Additionally,
> as you're processing through the initial file, unless you need
> the intermediate data, you should be able to do it in one pass.
> Something like
>
>   HEADER_ACCOUNT =3D "account"
>   HEADER_STAFF =3D "staff"
>   HEADER_QTY =3D "Qty"
>
>   processed =3D {}
>   with open("data.csv") as f:
>     reader =3D csv.DictReader(f)
>     for row in reader:
>       if should_process_row(row):
>         account =3D row[HEADER_ACCOUNT]
>         staff =3D row[HEADER_STAFF]
>         qty =3D row[HEADER_QTY]
>         try:
>           row[HEADER_QTY] =3D qty =3D int(qty)
>         except Exception:
>           # not a numeric quantity?
>           continue
>         # from Steven's code
>         key =3D (account, staff)
>         if key in processed:
>           processed[key][HEADER_QTY] +=3D qty
>         else:
>           processed[key][HEADER_QTY] =3D row
>   so_something_with(processed.values())
>
> I find that using names is a lot clearer than using arbitrary
> indexing.  Barring that, using indexes-as-constants still would
> add further clarity.
>
> -tkc
>
>
>
>
> .
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--001a1138165829165a0516167b37
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

maybe we can change this list to dict, using item[0] and item[1] as keys, t=
he whole item as value . then you can update by the same key i think<br><di=
v class=3D"gmail_quote">Tim Chase &lt;<a href=3D"mailto:python.list@tim.the=
chases.com">python.list@tim.thechases.com</a>&gt;=E4=BA=8E2015=E5=B9=B45=E6=
=9C=8815=E6=97=A5 =E5=91=A8=E4=BA=9401:17=E5=86=99=E9=81=93=EF=BC=9A<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex">On 2015-05-14 09:57, 20/20 Lab wrote:<br>
&gt; On 05/13/2015 06:23 PM, Steven D&#39;Aprano wrote:<br>
&gt;&gt;&gt; I have a LARGE csv file that I need to process.=C2=A0 110+ col=
umns,<br>
&gt;&gt;&gt; 72k rows.=C2=A0 I managed to write enough to reduce it to a fe=
w<br>
&gt;&gt;&gt; hundred rows, and the five columns I&#39;m interested in.<br>
&gt; I actually stumbled across the csv module after coding enough to<br>
&gt; make a list of lists.=C2=A0 So that is more the reason I approached th=
e<br>
&gt; list; Nothing like spending hours (or days) coding something that<br>
&gt; already exists and just dont know about.<br>
&gt;&gt;&gt; Now is were I have my problem:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; myList =3D [ [123, &quot;XXX&quot;, &quot;Item&quot;, &quot;Qt=
y&quot;, &quot;Noise&quot;],<br>
&gt;&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 [72976, &quot;=
YYY&quot;, &quot;Item&quot;, &quot;Qty&quot;, &quot;Noise&quot;],<br>
&gt;&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 [123, &quot;XX=
X&quot; &quot;ItemTypo&quot;, &quot;Qty&quot;, &quot;Noise&quot;]=C2=A0 =C2=
=A0 ]<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Basically, I need to check for rows with duplicate accounts<br=
>
&gt;&gt;&gt; row[0] and staff (row[1]), and if so, remove that row, and add=
<br>
&gt;&gt;&gt; it&#39;s Qty to the original row. I really dont have a clue ho=
w to<br>
&gt;&gt;&gt; go about this.<br>
&gt;&gt;<br>
&gt;&gt; processed =3D {}=C2=A0 # hold the processed data in a dict<br>
&gt;&gt;<br>
&gt;&gt; for row in myList:<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 account, staff =3D row[0:2]<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 key =3D (account, staff)=C2=A0 # Put them in a=
 tuple.<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 if key in processed:<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # We&#39;ve already seen this co=
mbination.<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 processed[key][3] +=3D row[3]=C2=
=A0 # Add the quantities.<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 else:<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # Never seen this combination be=
fore.<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 processed[key] =3D row<br>
&gt;&gt;<br>
&gt;&gt; newlist =3D list(processed.values())<br>
&gt;&gt;<br>
&gt; It does, immensely.=C2=A0 I&#39;ll make this work.=C2=A0 Thank you aga=
in for the<br>
&gt; link from yesterday and apologies for hitting the wrong reply<br>
&gt; button.=C2=A0 I&#39;ll have to study more on the usage and implementat=
ions<br>
&gt; of dictionaries and tuples.<br>
<br>
In processing the initial CSV file, I suspect that using a<br>
csv.DictReader would make the code a bit cleaner.=C2=A0 Additionally,<br>
as you&#39;re processing through the initial file, unless you need<br>
the intermediate data, you should be able to do it in one pass.<br>
Something like<br>
<br>
=C2=A0 HEADER_ACCOUNT =3D &quot;account&quot;<br>
=C2=A0 HEADER_STAFF =3D &quot;staff&quot;<br>
=C2=A0 HEADER_QTY =3D &quot;Qty&quot;<br>
<br>
=C2=A0 processed =3D {}<br>
=C2=A0 with open(&quot;data.csv&quot;) as f:<br>
=C2=A0 =C2=A0 reader =3D csv.DictReader(f)<br>
=C2=A0 =C2=A0 for row in reader:<br>
=C2=A0 =C2=A0 =C2=A0 if should_process_row(row):<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 account =3D row[HEADER_ACCOUNT]<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 staff =3D row[HEADER_STAFF]<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 qty =3D row[HEADER_QTY]<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 try:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 row[HEADER_QTY] =3D qty =3D int(qty)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 except Exception:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # not a numeric quantity?<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 continue<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 # from Steven&#39;s code<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 key =3D (account, staff)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if key in processed:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 processed[key][HEADER_QTY] +=3D qty<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 else:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 processed[key][HEADER_QTY] =3D row<br>
=C2=A0 so_something_with(processed.values())<br>
<br>
I find that using names is a lot clearer than using arbitrary<br>
indexing.=C2=A0 Barring that, using indexes-as-constants still would<br>
add further clarity.<br>
<br>
-tkc<br>
<br>
<br>
<br>
<br>
.<br>
--<br>
<a href=3D"https://mail.python.org/mailman/listinfo/python-list" target=3D"=
_blank">https://mail.python.org/mailman/listinfo/python-list</a><br>
</blockquote></div>

--001a1138165829165a0516167b37--