Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'output': 0.05; 'memory.': 0.07; 'source.': 0.07; 'string': 0.09; 'assuming': 0.09; 'input,': 0.09; 'optimizing': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:files': 0.09; 'python': 0.11; "'w')": 0.16; 'csv': 0.16; 'csv,': 0.16; 'doing,': 0.16; 'flush': 0.16; 'iterates': 0.16; 'iteration,': 0.16; 'lookups': 0.16; 'processes.': 0.16; 'pythonic': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'regex,': 0.16; 'splits': 0.16; 'subject:CSV': 0.16; 'throughput': 0.16; 'try?': 0.16; 'tuning,': 0.16; 'size,': 0.16; 'wrote:': 0.18; 'trying': 0.19; 'code,': 0.22; 'input': 0.22; 'memory': 0.22; 'this?': 0.23; 'header:User- Agent:1': 0.23; 'large,': 0.24; 'text,': 0.24; 'file.': 0.24; 'question': 0.24; 'equivalent': 0.26; 'handling': 0.26; 'header:X -Complaints-To:1': 0.27; 'point': 0.28; "we'd": 0.29; "i'm": 0.30; '(which': 0.31; 'lines': 0.31; 'bunch': 0.31; 'with,': 0.31; 'file': 0.32; 'probably': 0.32; 'quite': 0.32; "i'd": 0.34; 'could': 0.34; 'good.': 0.35; 'no,': 0.35; 'one,': 0.35; 'there': 0.35; 'disk': 0.36; 'right?': 0.36; 'charset:us-ascii': 0.36; 'subject:?': 0.36; 'hi,': 0.36; 'performance': 0.37; 'received:99': 0.38; 'writes': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'anything': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:u 10': 0.60; 'read': 0.60; 'above,': 0.60; 'then,': 0.60; 'most': 0.60; 'till': 0.61; "you're": 0.61; 'times': 0.62; "you'll": 0.62; 'show': 0.63; 'account': 0.65; 'default': 0.69; 'boost': 0.70; 'measure.': 0.84; 'reading,': 0.84; 'victor': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Dave Angel Subject: Re: Processing large CSV files - how to maximise throughput? Date: Fri, 25 Oct 2013 02:10:07 +0000 (UTC) References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: 99-43-76-189.lightspeed.hstntx.sbcglobal.net User-Agent: XPN/1.2.6 (Street Spirit ; Linux) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 51 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1382667030 news.xs4all.nl 15932 [2001:888:2000:d::a6]:48217 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:57488 On 24/10/2013 21:38, Victor Hooi wrote: > Hi, > > We have a directory of large CSV files that we'd like to process in Python. > > We process each input CSV, then generate a corresponding output CSV file. > > input CSV -> munging text, lookups etc. -> output CSV > > My question is, what's the most Pythonic way of handling this? (Which I'm assuming > > For the reading, I'd > > with open('input.csv', 'r') as input, open('output.csv', 'w') as output: > csv_writer = DictWriter(output) > for line in DictReader(input): > # Do some processing for that line... > output = process_line(line) > # Write output to file > csv_writer.writerow(output) > > So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good. > > For the writing - my understanding is that it writes a line to the file object each loop iteration, however, this will only get flushed to disk every now and then, based on my system default buffer size, right? > > So if the output file is going to get large, there isn't anything I need to take into account for conserving memory? No, the system will flush so often that you'll never use much memory. > > Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes. If you want help optimizing process_line(), you'd have to show us the source. For the regex, you can precompile it and not have to build it each time. Or just write the equivalent Python code, which many times is faster than a regex. > > If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput? Start multiple processes. For what you're doing, there's probably no point in multithreading. And as always, in performance tuning, you never know till you measure. -- DaveA