Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!tudelft.nl!txtfeed1.tudelft.nl!multikabel.net!newsfeed20.multikabel.net!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.011 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'feedback.': 0.05; 'concurrently': 0.07; 'problem?': 0.07; 'chunk': 0.09; 'concurrent': 0.09; 'counting': 0.09; 'here?': 0.09; 'portions': 0.09; 'subject:file': 0.09; 'threads': 0.09; 'threads,': 0.09; 'tune': 0.09; 'python': 0.11; 'python.': 0.12; 'subject:python': 0.12; 'cc:addr:python-list': 0.15; 'abhishek': 0.16; 'complicate': 0.16; 'disk.': 0.16; 'next.': 0.16; 'processes.': 0.16; 'subject:using': 0.16; 'subject:writing': 0.16; '\xa0maybe': 0.16; 'steve': 0.16; "haven't": 0.17; 'output': 0.18; '\xa0if': 0.18; 'mon,': 0.18; 'trying': 0.20; 'wrote:': 0.21; 'file.': 0.22; 'depend': 0.22; 'file,': 0.22; 'generic': 0.22; 'replace': 0.22; 'header:In-Reply-To:1': 0.22; 'algorithm': 0.23; 'code.': 0.24; 'example': 0.24; 'written': 0.24; 'subject:/': 0.25; 'run': 0.26; 'cc:no real name:2**0': 0.26; 'linux': 0.26; 'figure': 0.27; 'fine': 0.27; 'shared': 0.27; 'message-id:@mail.gmail.com': 0.27; 'cc:addr:python.org': 0.27; 'guys': 0.27; 'disk': 0.27; 'this.': 0.28; 'advice': 0.28; 'post': 0.28; 'pm,': 0.28; 'words,': 0.29; 'code': 0.29; 'file': 0.29; 'question': 0.30; 'lines': 0.30; 'shows': 0.30; 'realize': 0.30; 'cc:2**0': 0.31; 'easier': 0.31; 'experienced': 0.31; 'url:mailman': 0.31; 'received:209.85': 0.32; 'received:209.85.212': 0.32; 'towards': 0.32; 'received:google.com': 0.32; 'guess': 0.32; 'perform': 0.32; 'seconds': 0.32; 'implementing': 0.33; 'this:': 0.33; 'thanks': 0.34; 'url:python': 0.34; 'server': 0.34; 'reading': 0.34; 'received:209': 0.35; 'there': 0.35; 'similar': 0.35; 'should': 0.35; 'actually': 0.35; 'url:listinfo': 0.36; 'sure': 0.36; 'but': 0.36; 'url:org': 0.36; 'list': 0.37; 'appreciate': 0.37; '(in': 0.37; 'so,': 0.37; 'some': 0.37; 'data': 0.38; 'actual': 0.38; 'primary': 0.38; 'something': 0.38; 'comment': 0.38; 'correct': 0.38; 'core': 0.39; 'performance': 0.39; 'clear': 0.39; 'doing': 0.39; 'wanted': 0.39; 'how': 0.40; 'side': 0.60; 'your': 0.60; 'mar': 0.61; 'within': 0.61; 'single': 0.61; 'such': 0.61; 'better': 0.63; 'back': 0.63; 'ways': 0.63; 'more': 0.63; 'program.': 0.64; 'results': 0.65; 'different': 0.65; 'kind': 0.65; 'hope': 0.65; '26,': 0.66; 'suited': 0.66; 'times': 0.66; '2012': 0.69; 'worth': 0.70; 'records': 0.74; 'reaching': 0.76; 'calculations': 0.84; 'coordination': 0.84; 'dispatching': 0.84; 'faster.': 0.84; 'howell': 0.84; 'involved.': 0.84; 'nfs': 0.84; 'subject:reading': 0.84; '\xa0at': 0.84; 'to:addr:yahoo.com': 0.85; 'fast,': 0.91; 'stub.': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=DE3ETRdGNLpc+Sz4ValLgG/K43DmpeRa9xF57V9BiA4=; b=jYaVt1NIaYjbvr9FB5aNh5ZSxZRpXmcQelP7pR0GwKqYaAReH2sY5dPsDihyVYfDxy 6lE/mr6tRx4oB+11lTsfrmsaTEbOfiSDU9JGvTAcRICyeDvZymYN2XcFh9iuBWwJs7jE cjxfuTSXNENJ0jTnyhAd7c10YWTywdnywczntm6CpORe8MlcTRqKb/zMPwm/dT+hNDFB AFksj+csgVMTL5hmlqL3Gx+oxIAuiyNSMGdELVgzP9jznQLNPLVDXsRUUC+C3NdJdl/Q xpVRoO4oF/fZ5JyR+aivMRf6h2ZWyMXpUiA/yDbTpIn+pI3fDp/D7CLWpcOQOqNOiKCk Hesg== MIME-Version: 1.0 In-Reply-To: <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com> References: <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com> From: Abhishek Pratap Date: Mon, 26 Mar 2012 23:08:08 -0700 Subject: Re: concurrent file reading/writing using python To: Steve Howell Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 82 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1332828511 news.xs4all.nl 6947 [2001:888:2000:d::a6]:37522 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:22233 Thanks for the advice Dennis. @Steve : I haven't actually written the code. I was thinking more on the generic side and wanted to check if what I thought made sense and I now realize it can depend on then the I/O. For starters I was just thinking about counting lines in a line without doing any computation so this can be strictly I/O bound. I guess what I need to ask was can we improve on the existing disk I/O performance by reading different portions of the file using threads or processes. I am kind of pointing towards a MapReduce task on a file in a shared file system such as GPFS(from IBM). I realize this can be more suited to HDFS but wanted to know if people have implemented something similar on a normal linux based NFS -Abhi On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell wrote: > On Mar 26, 3:56=A0pm, Abhishek Pratap wrote: >> Hi Guys >> >> I am fwding this question from the python tutor list in the hope of >> reaching more people experienced in concurrent disk access in python. >> >> I am trying to see if there are ways in which I can read a big file >> concurrently on a multi core server and process data and write the >> output to a single file as the data is processed. >> >> For example if I have a 50Gb file, I would like to read it in parallel >> with 10 process/thread, each working on a 10Gb data and perform the >> same data parallel computation on each chunk of fine collating the >> output to a single file. >> >> I will appreciate your feedback. I did find some threads about this on >> stackoverflow but it was not clear to me what would be a good =A0way to >> go about implementing this. >> > > Have you written a single-core solution to your problem? =A0If so, can > you post the code here? > > If CPU isn't your primary bottleneck, then you need to be careful not > to overly complicate your solution by getting multiple cores > involved. =A0All the coordination might make your program slower and > more buggy. > > If CPU is the primary bottleneck, then you might want to consider an > approach where you only have a single thread that's reading records > from the file, 10 at a time, and then dispatching out the calculations > to different threads, then writing results back to disk. > > My approach would be something like this: > > =A01) Take a small sample of your dataset so that you can process it > within 10 seconds or so using a simple, single-core program. > =A02) Figure out whether you're CPU bound. =A0A simple way to do this is > to comment out the actual computation or replace it with a trivial > stub. =A0If you're CPU bound, the program will run much faster. =A0If > you're IO-bound, the program won't run much faster (since all the work > is actually just reading from disk). > =A03) Figure out how to read 10 records at a time and farm out the > records to threads. =A0Hopefully, your program will take significantly > less time. =A0At this point, don't obsess over collating data. =A0It migh= t > not be 10 times as fast, but it should be somewhat faster to be worth > your while. > =A04) If the threaded approach shows promise, make sure that you can > still generate correct output with that approach (in other words, > figure out out synchronization and collating). > > At the end of that experiment, you should have a better feel on where > to go next. > > What is the nature of your computation? =A0Maybe it would be easier to > tune the algorithm then figure out the multi-core optimization. > > > > > -- > http://mail.python.org/mailman/listinfo/python-list