Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #57487

Processing large CSV files - how to maximise throughput?

Newsgroups comp.lang.python
Date 2013-10-24 18:38 -0700
Message-ID <b4737555-cb4f-457b-aed7-a1e6553fe6a5@googlegroups.com> (permalink)
Subject Processing large CSV files - how to maximise throughput?
From Victor Hooi <victorhooi@gmail.com>

Show all headers | View raw


Hi,

We have a directory of large CSV files that we'd like to process in Python.

We process each input CSV, then generate a corresponding output CSV file.

input CSV -> munging text, lookups etc. -> output CSV

My question is, what's the most Pythonic way of handling this? (Which I'm assuming 

For the reading, I'd

    with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
        csv_writer = DictWriter(output)
        for line in DictReader(input):
            # Do some processing for that line...
            output = process_line(line)
            # Write output to file
            csv_writer.writerow(output)
            
So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.

For the writing - my understanding is that it writes a line to the file object each loop iteration, however, this will only get flushed to disk every now and then, based on my system default buffer size, right?

So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?

Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.

If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

Cheers,
Victor

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Processing large CSV files - how to maximise throughput? Victor Hooi <victorhooi@gmail.com> - 2013-10-24 18:38 -0700
  Re: Processing large CSV files - how to maximise throughput? Dave Angel <davea@davea.name> - 2013-10-25 02:10 +0000
    Re: Processing large CSV files - how to maximise throughput? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-25 03:35 +0000
      Re: Processing large CSV files - how to maximise throughput? Dave Angel <davea@davea.name> - 2013-10-25 03:57 +0000
      Re: Processing large CSV files - how to maximise throughput? Chris Angelico <rosuav@gmail.com> - 2013-10-25 17:13 +1100
      Re: Processing large CSV files - how to maximise throughput? Stefan Behnel <stefan_ml@behnel.de> - 2013-10-25 08:39 +0200
      Re: Processing large CSV files - how to maximise throughput? Chris Angelico <rosuav@gmail.com> - 2013-10-25 18:26 +1100
      Re: Processing large CSV files - how to maximise throughput? Dave Angel <davea@davea.name> - 2013-10-25 11:24 +0000
      Re: Processing large CSV files - how to maximise throughput? Chris Angelico <rosuav@gmail.com> - 2013-10-25 22:42 +1100
  Re: Processing large CSV files - how to maximise throughput? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-25 03:19 +0000
  Re: Processing large CSV files - how to maximise throughput? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-25 04:46 +0100
  Re: Processing large CSV files - how to maximise throughput? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-25 19:44 -0400
    Re: Processing large CSV files - how to maximise throughput? Roy Smith <roy@panix.com> - 2013-10-25 20:22 -0400
  Re: Processing large CSV files - how to maximise throughput? Walter Hurry <walterhurry@lavabit.com> - 2013-10-26 08:53 +0000

csiph-web