Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #90589 > unrolled thread
| Started by | 20/20 Lab <lab@pacbell.net> |
|---|---|
| First post | 2015-05-13 16:24 -0700 |
| Last post | 2015-05-20 14:18 -0700 |
| Articles | 7 — 5 participants |
Back to article view | Back to comp.lang.python
Looking for direction 20/20 Lab <lab@pacbell.net> - 2015-05-13 16:24 -0700
Re: Looking for direction Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-05-14 11:23 +1000
Re: Looking for direction 20/20 Lab <lab@pacbell.net> - 2015-05-14 09:57 -0700
Re: Looking for direction Tim Chase <python.list@tim.thechases.com> - 2015-05-14 12:17 -0500
Re: Looking for direction Ziqi Xiong <xiongziqi84@gmail.com> - 2015-05-15 03:31 +0000
Re: Looking for direction darnold <darnold992000@yahoo.com> - 2015-05-20 05:50 -0700
Re: Looking for direction 20/20 Lab <lab@pacbell.net> - 2015-05-20 14:18 -0700
| From | 20/20 Lab <lab@pacbell.net> |
|---|---|
| Date | 2015-05-13 16:24 -0700 |
| Subject | Looking for direction |
| Message-ID | <mailman.465.1431559626.12865.python-list@python.org> |
I'm a beginner to python. Reading here and there. Written a couple of
short and simple programs to make life easier around the office.
That being said, I'm not even sure what I need to ask for. I've never
worked with external data before.
I have a LARGE csv file that I need to process. 110+ columns, 72k
rows. I managed to write enough to reduce it to a few hundred rows, and
the five columns I'm interested in.
Now is were I have my problem:
myList = [ [123, "XXX", "Item", "Qty", "Noise"],
[72976, "YYY", "Item", "Qty", "Noise"],
[123, "XXX" "ItemTypo", "Qty", "Noise"] ]
Basically, I need to check for rows with duplicate accounts row[0] and
staff (row[1]), and if so, remove that row, and add it's Qty to the
original row. I really dont have a clue how to go about this. The
number of rows change based on which run it is, so I couldnt even get
away with using hundreds of compare loops.
If someone could point me to some documentation on the functions I would
need, or a tutorial it would be a great help.
Thank you.
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-05-14 11:23 +1000 |
| Message-ID | <5553f8fe$0$13012$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #90589 |
On Thu, 14 May 2015 09:24 am, 20/20 Lab wrote:
> I'm a beginner to python. Reading here and there. Written a couple of
> short and simple programs to make life easier around the office.
>
> That being said, I'm not even sure what I need to ask for. I've never
> worked with external data before.
>
> I have a LARGE csv file that I need to process. 110+ columns, 72k
> rows. I managed to write enough to reduce it to a few hundred rows, and
> the five columns I'm interested in.
That's not large. Large is millions of rows, or tens of millions if you have
enough memory. What's large to you and me is usually small to the computer.
You should use the csv module for handling the CSV file, if you aren't
already doing so. Do you need a url to the docs?
> Now is were I have my problem:
>
> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
> [72976, "YYY", "Item", "Qty", "Noise"],
> [123, "XXX" "ItemTypo", "Qty", "Noise"] ]
>
> Basically, I need to check for rows with duplicate accounts row[0] and
> staff (row[1]), and if so, remove that row, and add it's Qty to the
> original row. I really dont have a clue how to go about this.
Is the order of the rows important? If not, the problem is simpler.
processed = {} # hold the processed data in a dict
for row in myList:
account, staff = row[0:2]
key = (account, staff) # Put them in a tuple.
if key in processed:
# We've already seen this combination.
processed[key][3] += row[3] # Add the quantities.
else:
# Never seen this combination before.
processed[key] = row
newlist = list(processed.values())
Does that help?
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | 20/20 Lab <lab@pacbell.net> |
|---|---|
| Date | 2015-05-14 09:57 -0700 |
| Message-ID | <mailman.5.1431622814.17265.python-list@python.org> |
| In reply to | #90598 |
On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
> On Thu, 14 May 2015 09:24 am, 20/20 Lab wrote:
>
>> I'm a beginner to python. Reading here and there. Written a couple of
>> short and simple programs to make life easier around the office.
>>
>> That being said, I'm not even sure what I need to ask for. I've never
>> worked with external data before.
>>
>> I have a LARGE csv file that I need to process. 110+ columns, 72k
>> rows. I managed to write enough to reduce it to a few hundred rows, and
>> the five columns I'm interested in.
> That's not large. Large is millions of rows, or tens of millions if you have
> enough memory. What's large to you and me is usually small to the computer.
>
> You should use the csv module for handling the CSV file, if you aren't
> already doing so. Do you need a url to the docs?
>
I actually stumbled across the csv module after coding enough to make a
list of lists. So that is more the reason I approached the list;
Nothing like spending hours (or days) coding something that already
exists and just dont know about.
>> Now is were I have my problem:
>>
>> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
>> [72976, "YYY", "Item", "Qty", "Noise"],
>> [123, "XXX" "ItemTypo", "Qty", "Noise"] ]
>>
>> Basically, I need to check for rows with duplicate accounts row[0] and
>> staff (row[1]), and if so, remove that row, and add it's Qty to the
>> original row. I really dont have a clue how to go about this.
> Is the order of the rows important? If not, the problem is simpler.
>
>
> processed = {} # hold the processed data in a dict
>
> for row in myList:
> account, staff = row[0:2]
> key = (account, staff) # Put them in a tuple.
> if key in processed:
> # We've already seen this combination.
> processed[key][3] += row[3] # Add the quantities.
> else:
> # Never seen this combination before.
> processed[key] = row
>
> newlist = list(processed.values())
>
>
> Does that help?
>
>
>
It does, immensely. I'll make this work. Thank you again for the link
from yesterday and apologies for hitting the wrong reply button. I'll
have to study more on the usage and implementations of dictionaries and
tuples.
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2015-05-14 12:17 -0500 |
| Message-ID | <mailman.9.1431623856.17265.python-list@python.org> |
| In reply to | #90598 |
On 2015-05-14 09:57, 20/20 Lab wrote:
> On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
>>> I have a LARGE csv file that I need to process. 110+ columns,
>>> 72k rows. I managed to write enough to reduce it to a few
>>> hundred rows, and the five columns I'm interested in.
> I actually stumbled across the csv module after coding enough to
> make a list of lists. So that is more the reason I approached the
> list; Nothing like spending hours (or days) coding something that
> already exists and just dont know about.
>>> Now is were I have my problem:
>>>
>>> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
>>> [72976, "YYY", "Item", "Qty", "Noise"],
>>> [123, "XXX" "ItemTypo", "Qty", "Noise"] ]
>>>
>>> Basically, I need to check for rows with duplicate accounts
>>> row[0] and staff (row[1]), and if so, remove that row, and add
>>> it's Qty to the original row. I really dont have a clue how to
>>> go about this.
>>
>> processed = {} # hold the processed data in a dict
>>
>> for row in myList:
>> account, staff = row[0:2]
>> key = (account, staff) # Put them in a tuple.
>> if key in processed:
>> # We've already seen this combination.
>> processed[key][3] += row[3] # Add the quantities.
>> else:
>> # Never seen this combination before.
>> processed[key] = row
>>
>> newlist = list(processed.values())
>>
> It does, immensely. I'll make this work. Thank you again for the
> link from yesterday and apologies for hitting the wrong reply
> button. I'll have to study more on the usage and implementations
> of dictionaries and tuples.
In processing the initial CSV file, I suspect that using a
csv.DictReader would make the code a bit cleaner. Additionally,
as you're processing through the initial file, unless you need
the intermediate data, you should be able to do it in one pass.
Something like
HEADER_ACCOUNT = "account"
HEADER_STAFF = "staff"
HEADER_QTY = "Qty"
processed = {}
with open("data.csv") as f:
reader = csv.DictReader(f)
for row in reader:
if should_process_row(row):
account = row[HEADER_ACCOUNT]
staff = row[HEADER_STAFF]
qty = row[HEADER_QTY]
try:
row[HEADER_QTY] = qty = int(qty)
except Exception:
# not a numeric quantity?
continue
# from Steven's code
key = (account, staff)
if key in processed:
processed[key][HEADER_QTY] += qty
else:
processed[key][HEADER_QTY] = row
so_something_with(processed.values())
I find that using names is a lot clearer than using arbitrary
indexing. Barring that, using indexes-as-constants still would
add further clarity.
-tkc
.
[toc] | [prev] | [next] | [standalone]
| From | Ziqi Xiong <xiongziqi84@gmail.com> |
|---|---|
| Date | 2015-05-15 03:31 +0000 |
| Message-ID | <mailman.29.1431674927.17265.python-list@python.org> |
| In reply to | #90598 |
[Multipart message — attachments visible in raw view] — view raw
maybe we can change this list to dict, using item[0] and item[1] as keys,
the whole item as value . then you can update by the same key i think
Tim Chase <python.list@tim.thechases.com>于2015年5月15日 周五01:17写道:
> On 2015-05-14 09:57, 20/20 Lab wrote:
> > On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
> >>> I have a LARGE csv file that I need to process. 110+ columns,
> >>> 72k rows. I managed to write enough to reduce it to a few
> >>> hundred rows, and the five columns I'm interested in.
> > I actually stumbled across the csv module after coding enough to
> > make a list of lists. So that is more the reason I approached the
> > list; Nothing like spending hours (or days) coding something that
> > already exists and just dont know about.
> >>> Now is were I have my problem:
> >>>
> >>> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
> >>> [72976, "YYY", "Item", "Qty", "Noise"],
> >>> [123, "XXX" "ItemTypo", "Qty", "Noise"] ]
> >>>
> >>> Basically, I need to check for rows with duplicate accounts
> >>> row[0] and staff (row[1]), and if so, remove that row, and add
> >>> it's Qty to the original row. I really dont have a clue how to
> >>> go about this.
> >>
> >> processed = {} # hold the processed data in a dict
> >>
> >> for row in myList:
> >> account, staff = row[0:2]
> >> key = (account, staff) # Put them in a tuple.
> >> if key in processed:
> >> # We've already seen this combination.
> >> processed[key][3] += row[3] # Add the quantities.
> >> else:
> >> # Never seen this combination before.
> >> processed[key] = row
> >>
> >> newlist = list(processed.values())
> >>
> > It does, immensely. I'll make this work. Thank you again for the
> > link from yesterday and apologies for hitting the wrong reply
> > button. I'll have to study more on the usage and implementations
> > of dictionaries and tuples.
>
> In processing the initial CSV file, I suspect that using a
> csv.DictReader would make the code a bit cleaner. Additionally,
> as you're processing through the initial file, unless you need
> the intermediate data, you should be able to do it in one pass.
> Something like
>
> HEADER_ACCOUNT = "account"
> HEADER_STAFF = "staff"
> HEADER_QTY = "Qty"
>
> processed = {}
> with open("data.csv") as f:
> reader = csv.DictReader(f)
> for row in reader:
> if should_process_row(row):
> account = row[HEADER_ACCOUNT]
> staff = row[HEADER_STAFF]
> qty = row[HEADER_QTY]
> try:
> row[HEADER_QTY] = qty = int(qty)
> except Exception:
> # not a numeric quantity?
> continue
> # from Steven's code
> key = (account, staff)
> if key in processed:
> processed[key][HEADER_QTY] += qty
> else:
> processed[key][HEADER_QTY] = row
> so_something_with(processed.values())
>
> I find that using names is a lot clearer than using arbitrary
> indexing. Barring that, using indexes-as-constants still would
> add further clarity.
>
> -tkc
>
>
>
>
> .
> --
> https://mail.python.org/mailman/listinfo/python-list
>
[toc] | [prev] | [next] | [standalone]
| From | darnold <darnold992000@yahoo.com> |
|---|---|
| Date | 2015-05-20 05:50 -0700 |
| Message-ID | <9abf87a2-a98b-470b-9f94-a76d4ef1b34e@googlegroups.com> |
| In reply to | #90589 |
I recommend getting your hands on "Automate The Boring Stuff With Python" from no starch press: http://www.nostarch.com/automatestuff I've not read it in its entirety, but it's very beginner-friendly and is targeted at just the sort of processing you appear to be doing. HTH, Don
[toc] | [prev] | [next] | [standalone]
| From | 20/20 Lab <lab@pacbell.net> |
|---|---|
| Date | 2015-05-20 14:18 -0700 |
| Message-ID | <mailman.193.1432193421.17265.python-list@python.org> |
| In reply to | #90952 |
Your the second to recommend this to me. I ended up picking it up last week. So I need to sit down with it. I was able to get a working project. However, I dont fully grasp the details on how. So the book will help I'm sure. Thank you. On 05/20/2015 05:50 AM, darnold via Python-list wrote: > I recommend getting your hands on "Automate The Boring Stuff With Python" from no starch press: > > http://www.nostarch.com/automatestuff > > I've not read it in its entirety, but it's very beginner-friendly and is targeted at just the sort of processing you appear to be doing. > > HTH, > Don
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web