Groups > comp.lang.python > #90589 > unrolled thread

Looking for direction

Started by	20/20 Lab <lab@pacbell.net>
First post	2015-05-13 16:24 -0700
Last post	2015-05-20 14:18 -0700
Articles	7 — 5 participants

Back to article view | Back to comp.lang.python

  Looking for direction 20/20 Lab <lab@pacbell.net> - 2015-05-13 16:24 -0700
    Re: Looking for direction Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-05-14 11:23 +1000
      Re: Looking for direction 20/20 Lab <lab@pacbell.net> - 2015-05-14 09:57 -0700
      Re: Looking for direction Tim Chase <python.list@tim.thechases.com> - 2015-05-14 12:17 -0500
      Re: Looking for direction Ziqi Xiong <xiongziqi84@gmail.com> - 2015-05-15 03:31 +0000
    Re: Looking for direction darnold <darnold992000@yahoo.com> - 2015-05-20 05:50 -0700
      Re: Looking for direction 20/20 Lab <lab@pacbell.net> - 2015-05-20 14:18 -0700

#90589 — Looking for direction

From	20/20 Lab <lab@pacbell.net>
Date	2015-05-13 16:24 -0700
Subject	Looking for direction
Message-ID	<mailman.465.1431559626.12865.python-list@python.org>

I'm a beginner to python.  Reading here and there.  Written a couple of 
short and simple programs to make life easier around the office.

That being said, I'm not even sure what I need to ask for. I've never 
worked with external data before.

I have a LARGE csv file that I need to process.  110+ columns, 72k 
rows.  I managed to write enough to reduce it to a few hundred rows, and 
the five columns I'm interested in.

Now is were I have my problem:

myList = [ [123, "XXX", "Item", "Qty", "Noise"],
            [72976, "YYY", "Item", "Qty", "Noise"],
            [123, "XXX" "ItemTypo", "Qty", "Noise"]    ]

Basically, I need to check for rows with duplicate accounts row[0] and 
staff (row[1]), and if so, remove that row, and add it's Qty to the 
original row. I really dont have a clue how to go about this.  The 
number of rows change based on which run it is, so I couldnt even get 
away with using hundreds of compare loops.

If someone could point me to some documentation on the functions I would 
need, or a tutorial it would be a great help.

Thank you.

[toc] | [next] | [standalone]

#90598

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-05-14 11:23 +1000
Message-ID	<5553f8fe$0$13012$c3e8da3$5496439d@news.astraweb.com>
In reply to	#90589

On Thu, 14 May 2015 09:24 am, 20/20 Lab wrote:

> I'm a beginner to python.  Reading here and there.  Written a couple of
> short and simple programs to make life easier around the office.
> 
> That being said, I'm not even sure what I need to ask for. I've never
> worked with external data before.
> 
> I have a LARGE csv file that I need to process.  110+ columns, 72k
> rows.  I managed to write enough to reduce it to a few hundred rows, and
> the five columns I'm interested in.

That's not large. Large is millions of rows, or tens of millions if you have
enough memory. What's large to you and me is usually small to the computer.

You should use the csv module for handling the CSV file, if you aren't
already doing so. Do you need a url to the docs?


> Now is were I have my problem:
> 
> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
>             [72976, "YYY", "Item", "Qty", "Noise"],
>             [123, "XXX" "ItemTypo", "Qty", "Noise"]    ]
> 
> Basically, I need to check for rows with duplicate accounts row[0] and
> staff (row[1]), and if so, remove that row, and add it's Qty to the
> original row. I really dont have a clue how to go about this.

Is the order of the rows important? If not, the problem is simpler.


processed = {}  # hold the processed data in a dict

for row in myList:
    account, staff = row[0:2]
    key = (account, staff)  # Put them in a tuple.
    if key in processed:
        # We've already seen this combination.
        processed[key][3] += row[3]  # Add the quantities.
    else:
        # Never seen this combination before.
        processed[key] = row

newlist = list(processed.values())


Does that help?



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#90613

From	20/20 Lab <lab@pacbell.net>
Date	2015-05-14 09:57 -0700
Message-ID	<mailman.5.1431622814.17265.python-list@python.org>
In reply to	#90598


On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
> On Thu, 14 May 2015 09:24 am, 20/20 Lab wrote:
>
>> I'm a beginner to python.  Reading here and there.  Written a couple of
>> short and simple programs to make life easier around the office.
>>
>> That being said, I'm not even sure what I need to ask for. I've never
>> worked with external data before.
>>
>> I have a LARGE csv file that I need to process.  110+ columns, 72k
>> rows.  I managed to write enough to reduce it to a few hundred rows, and
>> the five columns I'm interested in.
> That's not large. Large is millions of rows, or tens of millions if you have
> enough memory. What's large to you and me is usually small to the computer.
>
> You should use the csv module for handling the CSV file, if you aren't
> already doing so. Do you need a url to the docs?
>
I actually stumbled across the csv module after coding enough to make a 
list of lists.  So that is more the reason I approached the list;  
Nothing like spending hours (or days) coding something that already 
exists and just dont know about.
>> Now is were I have my problem:
>>
>> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
>>              [72976, "YYY", "Item", "Qty", "Noise"],
>>              [123, "XXX" "ItemTypo", "Qty", "Noise"]    ]
>>
>> Basically, I need to check for rows with duplicate accounts row[0] and
>> staff (row[1]), and if so, remove that row, and add it's Qty to the
>> original row. I really dont have a clue how to go about this.
> Is the order of the rows important? If not, the problem is simpler.
>
>
> processed = {}  # hold the processed data in a dict
>
> for row in myList:
>      account, staff = row[0:2]
>      key = (account, staff)  # Put them in a tuple.
>      if key in processed:
>          # We've already seen this combination.
>          processed[key][3] += row[3]  # Add the quantities.
>      else:
>          # Never seen this combination before.
>          processed[key] = row
>
> newlist = list(processed.values())
>
>
> Does that help?
>
>
>
It does, immensely.  I'll make this work.  Thank you again for the link 
from yesterday and apologies for hitting the wrong reply button.  I'll 
have to study more on the usage and implementations of dictionaries and 
tuples.

[toc] | [prev] | [next] | [standalone]

#90618

From	Tim Chase <python.list@tim.thechases.com>
Date	2015-05-14 12:17 -0500
Message-ID	<mailman.9.1431623856.17265.python-list@python.org>
In reply to	#90598

On 2015-05-14 09:57, 20/20 Lab wrote:
> On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
>>> I have a LARGE csv file that I need to process.  110+ columns,
>>> 72k rows.  I managed to write enough to reduce it to a few
>>> hundred rows, and the five columns I'm interested in.
> I actually stumbled across the csv module after coding enough to
> make a list of lists.  So that is more the reason I approached the
> list; Nothing like spending hours (or days) coding something that
> already exists and just dont know about.
>>> Now is were I have my problem:
>>>
>>> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
>>>              [72976, "YYY", "Item", "Qty", "Noise"],
>>>              [123, "XXX" "ItemTypo", "Qty", "Noise"]    ]
>>>
>>> Basically, I need to check for rows with duplicate accounts
>>> row[0] and staff (row[1]), and if so, remove that row, and add
>>> it's Qty to the original row. I really dont have a clue how to
>>> go about this.
>>
>> processed = {}  # hold the processed data in a dict
>>
>> for row in myList:
>>      account, staff = row[0:2]
>>      key = (account, staff)  # Put them in a tuple.
>>      if key in processed:
>>          # We've already seen this combination.
>>          processed[key][3] += row[3]  # Add the quantities.
>>      else:
>>          # Never seen this combination before.
>>          processed[key] = row
>>
>> newlist = list(processed.values())
>>
> It does, immensely.  I'll make this work.  Thank you again for the
> link from yesterday and apologies for hitting the wrong reply
> button.  I'll have to study more on the usage and implementations
> of dictionaries and tuples.

In processing the initial CSV file, I suspect that using a
csv.DictReader would make the code a bit cleaner.  Additionally,
as you're processing through the initial file, unless you need
the intermediate data, you should be able to do it in one pass.
Something like

  HEADER_ACCOUNT = "account"
  HEADER_STAFF = "staff"
  HEADER_QTY = "Qty"

  processed = {}
  with open("data.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
      if should_process_row(row):
        account = row[HEADER_ACCOUNT]
        staff = row[HEADER_STAFF]
        qty = row[HEADER_QTY]
        try:
          row[HEADER_QTY] = qty = int(qty)
        except Exception:
          # not a numeric quantity?
          continue
        # from Steven's code
        key = (account, staff)
        if key in processed:
          processed[key][HEADER_QTY] += qty
        else:
          processed[key][HEADER_QTY] = row
  so_something_with(processed.values())
          
I find that using names is a lot clearer than using arbitrary
indexing.  Barring that, using indexes-as-constants still would
add further clarity.

-tkc




.

[toc] | [prev] | [next] | [standalone]

#90654

From	Ziqi Xiong <xiongziqi84@gmail.com>
Date	2015-05-15 03:31 +0000
Message-ID	<mailman.29.1431674927.17265.python-list@python.org>
In reply to	#90598

[Multipart message — attachments visible in raw view] — view raw

maybe we can change this list to dict, using item[0] and item[1] as keys,
the whole item as value . then you can update by the same key i think
Tim Chase <python.list@tim.thechases.com>于2015年5月15日 周五01:17写道：

> On 2015-05-14 09:57, 20/20 Lab wrote:
> > On 05/13/2015 06:23 PM, Steven D'Aprano wrote:
> >>> I have a LARGE csv file that I need to process.  110+ columns,
> >>> 72k rows.  I managed to write enough to reduce it to a few
> >>> hundred rows, and the five columns I'm interested in.
> > I actually stumbled across the csv module after coding enough to
> > make a list of lists.  So that is more the reason I approached the
> > list; Nothing like spending hours (or days) coding something that
> > already exists and just dont know about.
> >>> Now is were I have my problem:
> >>>
> >>> myList = [ [123, "XXX", "Item", "Qty", "Noise"],
> >>>              [72976, "YYY", "Item", "Qty", "Noise"],
> >>>              [123, "XXX" "ItemTypo", "Qty", "Noise"]    ]
> >>>
> >>> Basically, I need to check for rows with duplicate accounts
> >>> row[0] and staff (row[1]), and if so, remove that row, and add
> >>> it's Qty to the original row. I really dont have a clue how to
> >>> go about this.
> >>
> >> processed = {}  # hold the processed data in a dict
> >>
> >> for row in myList:
> >>      account, staff = row[0:2]
> >>      key = (account, staff)  # Put them in a tuple.
> >>      if key in processed:
> >>          # We've already seen this combination.
> >>          processed[key][3] += row[3]  # Add the quantities.
> >>      else:
> >>          # Never seen this combination before.
> >>          processed[key] = row
> >>
> >> newlist = list(processed.values())
> >>
> > It does, immensely.  I'll make this work.  Thank you again for the
> > link from yesterday and apologies for hitting the wrong reply
> > button.  I'll have to study more on the usage and implementations
> > of dictionaries and tuples.
>
> In processing the initial CSV file, I suspect that using a
> csv.DictReader would make the code a bit cleaner.  Additionally,
> as you're processing through the initial file, unless you need
> the intermediate data, you should be able to do it in one pass.
> Something like
>
>   HEADER_ACCOUNT = "account"
>   HEADER_STAFF = "staff"
>   HEADER_QTY = "Qty"
>
>   processed = {}
>   with open("data.csv") as f:
>     reader = csv.DictReader(f)
>     for row in reader:
>       if should_process_row(row):
>         account = row[HEADER_ACCOUNT]
>         staff = row[HEADER_STAFF]
>         qty = row[HEADER_QTY]
>         try:
>           row[HEADER_QTY] = qty = int(qty)
>         except Exception:
>           # not a numeric quantity?
>           continue
>         # from Steven's code
>         key = (account, staff)
>         if key in processed:
>           processed[key][HEADER_QTY] += qty
>         else:
>           processed[key][HEADER_QTY] = row
>   so_something_with(processed.values())
>
> I find that using names is a lot clearer than using arbitrary
> indexing.  Barring that, using indexes-as-constants still would
> add further clarity.
>
> -tkc
>
>
>
>
> .
> --
> https://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]

#90952

From	darnold <darnold992000@yahoo.com>
Date	2015-05-20 05:50 -0700
Message-ID	<9abf87a2-a98b-470b-9f94-a76d4ef1b34e@googlegroups.com>
In reply to	#90589

I recommend getting your hands on "Automate The Boring Stuff With Python" from no starch press:

http://www.nostarch.com/automatestuff

I've not read it in its entirety, but it's very beginner-friendly and is targeted at just the sort of processing you appear to be doing.

HTH,
Don

[toc] | [prev] | [next] | [standalone]

#90999

From	20/20 Lab <lab@pacbell.net>
Date	2015-05-20 14:18 -0700
Message-ID	<mailman.193.1432193421.17265.python-list@python.org>
In reply to	#90952

Your the second to recommend this to me.  I ended up picking it up last 
week.  So I need to sit down with it.  I was able to get a working 
project.  However, I dont fully grasp the details on how. So the book 
will help I'm sure.

Thank you.

On 05/20/2015 05:50 AM, darnold via Python-list wrote:
> I recommend getting your hands on "Automate The Boring Stuff With Python" from no starch press:
>
> http://www.nostarch.com/automatestuff
>
> I've not read it in its entirety, but it's very beginner-friendly and is targeted at just the sort of processing you appear to be doing.
>
> HTH,
> Don

[toc] | [prev] | [standalone]

csiph-web

Looking for direction

Contents

#90589 — Looking for direction

#90598

#90613

#90618

#90654

#90952

#90999