Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Dennis Lee Bieber <wlfraed@ix.netcom.com>
Subject: Re: Processing large CSV files - how to maximise throughput?
Date: Fri, 25 Oct 2013 19:44:43 -0400
Organization: IISS Elusive Unicorn
References: <b4737555-cb4f-457b-aed7-a1e6553fe6a5@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1560.1382744694.18130.python-list@python.org>
Lines: 63
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:57577

On Thu, 24 Oct 2013 18:38:21 -0700 (PDT), Victor Hooi
<victorhooi@gmail.com> declaimed the following:

>
>For the reading, I'd
>
>    with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
>        csv_writer = DictWriter(output)
>        for line in DictReader(input):
>            # Do some processing for that line...
>            output = process_line(line)
>            # Write output to file
>            csv_writer.writerow(output)
>            
>So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.
>
	My first comment would be: Do you really need the overhead of using
/dictionary/ reader/writer mode? Surely this processing isn't taking in
ad-hoc CSV files, is it? If all the inputs look the same using the regular
CSV reader may be faster (same for the writer) and your processing would be
using direct positional indexing rather than dictionary hashing/lookups.

>So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?
>
	Memory is cheap -- I/O is slow. <G> Just how massive are these CSV
files?

>Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.
>
>If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

	You are likely I/O bound, not CPU bound -- so multi-core doesn't really
affect things (neither would the GIL).

	You could maybe try using three Threads and a pair of Queues: reader,
process, writer. Limit the Queues to maybe 50-100 entries (tweak to liking
-- I run a task where using 100 entries results in a long "load" phase
before processing begins).

	Reader thread just reads entries from the input file and adds them to
the input queue. Processing thread takes entries from the input queue,
mashes them, and puts them onto the output queue. Writer thread, obviously,
takes items from the output queue and writes them to the file.

	Ideally, what should happen is, after a few seconds of the reader
monopolizing the CPU, you will have a backlog of records in the input
queue, and will initially block on "queue full", letting the processing
thread crunch some entries -- at some quantum the reader will get control,
find the queue is not full, and initiate the next read and blocks for the
I/O to complete; the processing thread gets control again and continues
number crunching. Hopefully, before the queue goes empty the reader will
complete the I/O and add the next entry to the queue. Same for output
queue.

	The main idea is that you don't block crunching while waiting for the
next record to be read or written. Crunching on blocks if the input runs
empty or the output fills up -- conditions in which the reader or writer
then gets control.

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/