Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Tue, 23 Apr 2013 09:30:05 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: Neil Cerutti <neilc@norwich.edu>
Subject: Re: There must be a better way
In-Reply-To: <atnh2jFgv8iU1@mid.individual.net>
References: <kkv9bt$9bm$1@theodyn.ncf.ca> <51732d81$0$29977$c3e8da3$5496439d@news.astraweb.com> <20130420193422.25255e98@bigbox.christie.dr> <mailman.869.1366506610.3114.python-list@python.org> <kl0opb$pcr$1@theodyn.ncf.ca> <atl0i2Fto6uU2@mid.individual.net> <kl3stb$5ck$1@theodyn.ncf.ca> <atnh2jFgv8iU1@mid.individual.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.973.1366727321.3114.python-list@python.org>
Lines: 58
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:44181

On 2013-04-23 13:36, Neil Cerutti wrote:
> On 2013-04-22, Colin J. Williams <cjw@ncf.ca> wrote:
> > Since I'm only interested in one or two columns, the simpler
> > approach is probably better.
> 
> Here's a sketch of how one of my projects handles that situation.
> I think the index variables are invaluable documentation, and
> make it a bit more robust. (Python 3, so not every bit is
> relevant to you).
> 
> with open("today.csv", encoding='UTF-8', newline='') as today_file:
>     reader = csv.reader(today_file)
>     header = next(reader)
>     majr_index = header.index('MAJR')
>     div_index = header.index('DIV')
>     for rec in reader:
>         major = rec[majr_index]
>         rec[div_index] = DIVISION_TABLE[major]
> 
> But a csv.DictReader might still be more efficient. I never
> tested. This is the only place I've used this "optimization".
> It's fast enough. ;)

I believe the csv module does all the work at c-level, rather than
as  pure Python, so it should be notably faster.  The only times I've
had to do things by hand like that are when there are header
peculiarities that I can't control, such as mismatched case or
added/remove punctuation (client files are notorious for this).  So I
often end up doing something like

  def normalize(header):
    return header.strip().upper() # other cleanup as needed

  reader = csv.reader(f)
  headers = next(reader)
  header_map = dict(
    (normalize(header), i)
    for i, header
    in enumerate(headers)
    )
  item = lambda col: row[header_map[col]].strip()
  for row in reader:
    major = item("MAJR").upper()
    division = item("DIV")
    # ...

The function calling might add overhead (in which case one could
just use explicit indirect indexing for each value assignment:

  major = row[header_map["MAJR"]].strip().upper()

but I usually find that processing CSV files leaves me I/O bound
rather than CPU bound.

-tkc