Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'output': 0.05; 'indexing': 0.07; 'puts': 0.07; 'string': 0.09; 'input,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'slow.': 0.09; 'subject:files': 0.09; 'runs': 0.10; 'thread': 0.14; "'w')": 0.16; '(same': 0.16; 'be:': 0.16; 'blocks': 0.16; 'bound,': 0.16; 'complete;': 0.16; 'csv': 0.16; 'ideally,': 0.16; 'inputs': 0.16; 'iterates': 0.16; 'letting': 0.16; 'message-id:@4ax.com': 0.16; 'obviously,': 0.16; 'positional': 0.16; 'quantum': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'splits': 0.16; 'subject:CSV': 0.16; 'thread,': 0.16; 'throughput': 0.16; 'try?': 0.16; 'written.': 0.16; 'trying': 0.19; 'cheap': 0.19; 'thu,': 0.19; 'input': 0.22; 'memory': 0.22; 'adds': 0.24; 'entries': 0.24; 'large,': 0.24; 'url:home': 0.24; 'file.': 0.24; 'task': 0.26; 'gets': 0.27; 'header:X-Complaints- To:1': 0.27; 'record': 0.27; 'idea': 0.28; "doesn't": 0.30; "i'm": 0.30; 'lines': 0.31; 'bunch': 0.31; 'continues': 0.31; 'initiate': 0.31; 'overhead': 0.31; 'with,': 0.31; 'file': 0.32; 'regular': 0.32; 'run': 0.32; 'quite': 0.32; 'reader': 0.33; 'comment': 0.34; 'maybe': 0.34; "i'd": 0.34; 'could': 0.34; 'good.': 0.35; 'one,': 0.35; 'add': 0.35; 'there': 0.35; 'really': 0.36; 'surely': 0.36; 'next': 0.36; 'entry': 0.36; 'charset:us-ascii': 0.36; 'subject:?': 0.36; 'should': 0.36; 'seconds': 0.37; 'initially': 0.38; 'massive': 0.38; 'process,': 0.38; 'writes': 0.38; 'to:addr :python-list': 0.38; 'files': 0.38; 'rather': 0.38; 'anything': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'how': 0.40; 'read': 0.60; 'above,': 0.60; 'affect': 0.61; 'first': 0.61; 'complete': 0.62; 'happen': 0.63; 'taking': 0.65; 'account': 0.65; 'direct': 0.67; 'reads': 0.68; 'results': 0.69; 'boost': 0.70; 'limit': 0.70; 'records': 0.73; '100': 0.79; 'cpu,': 0.84; 'liking': 0.84; 'reading,': 0.84; 'victor': 0.84; 'received:108': 0.93; '2013': 0.98 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Dennis Lee Bieber Subject: Re: Processing large CSV files - how to maximise throughput? Date: Fri, 25 Oct 2013 19:44:43 -0400 Organization: IISS Elusive Unicorn References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: adsl-108-79-222-166.dsl.klmzmi.sbcglobal.net X-Newsreader: Forte Agent 6.00/32.1186 X-No-Archive: YES X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 63 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1382744694 news.xs4all.nl 15874 [2001:888:2000:d::a6]:47139 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:57577 On Thu, 24 Oct 2013 18:38:21 -0700 (PDT), Victor Hooi declaimed the following: > >For the reading, I'd > > with open('input.csv', 'r') as input, open('output.csv', 'w') as output: > csv_writer = DictWriter(output) > for line in DictReader(input): > # Do some processing for that line... > output = process_line(line) > # Write output to file > csv_writer.writerow(output) > >So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good. > My first comment would be: Do you really need the overhead of using /dictionary/ reader/writer mode? Surely this processing isn't taking in ad-hoc CSV files, is it? If all the inputs look the same using the regular CSV reader may be faster (same for the writer) and your processing would be using direct positional indexing rather than dictionary hashing/lookups. >So if the output file is going to get large, there isn't anything I need to take into account for conserving memory? > Memory is cheap -- I/O is slow. Just how massive are these CSV files? >Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes. > >If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput? You are likely I/O bound, not CPU bound -- so multi-core doesn't really affect things (neither would the GIL). You could maybe try using three Threads and a pair of Queues: reader, process, writer. Limit the Queues to maybe 50-100 entries (tweak to liking -- I run a task where using 100 entries results in a long "load" phase before processing begins). Reader thread just reads entries from the input file and adds them to the input queue. Processing thread takes entries from the input queue, mashes them, and puts them onto the output queue. Writer thread, obviously, takes items from the output queue and writes them to the file. Ideally, what should happen is, after a few seconds of the reader monopolizing the CPU, you will have a backlog of records in the input queue, and will initially block on "queue full", letting the processing thread crunch some entries -- at some quantum the reader will get control, find the queue is not full, and initiate the next read and blocks for the I/O to complete; the processing thread gets control again and continues number crunching. Hopefully, before the queue goes empty the reader will complete the I/O and add the next entry to the queue. Same for output queue. The main idea is that you don't block crunching while waiting for the next record to be read or written. Crunching on blocks if the input runs empty or the output fills up -- conditions in which the reader or writer then gets control. -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/