Re: concurrent file reading/writing using python

Path	csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!tudelft.nl!txtfeed1.tudelft.nl!multikabel.net!newsfeed20.multikabel.net!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<abhishek.vit@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.011
X-Spam-Evidence	'H': 0.98; 'S': 0.00; 'feedback.': 0.05; 'concurrently': 0.07; 'problem?': 0.07; 'chunk': 0.09; 'concurrent': 0.09; 'counting': 0.09; 'here?': 0.09; 'portions': 0.09; 'subject:file': 0.09; 'threads': 0.09; 'threads,': 0.09; 'tune': 0.09; 'python': 0.11; 'python.': 0.12; 'subject:python': 0.12; 'cc:addr:python-list': 0.15; 'abhishek': 0.16; 'complicate': 0.16; 'disk.': 0.16; 'next.': 0.16; 'processes.': 0.16; 'subject:using': 0.16; 'subject:writing': 0.16; '\xa0maybe': 0.16; 'steve': 0.16; "haven't": 0.17; 'output': 0.18; '\xa0if': 0.18; 'mon,': 0.18; 'trying': 0.20; 'wrote:': 0.21; 'file.': 0.22; 'depend': 0.22; 'file,': 0.22; 'generic': 0.22; 'replace': 0.22; 'header:In-Reply-To:1': 0.22; 'algorithm': 0.23; 'code.': 0.24; 'example': 0.24; 'written': 0.24; 'subject:/': 0.25; 'run': 0.26; 'cc:no real name:20': 0.26; 'linux': 0.26; 'figure': 0.27; 'fine': 0.27; 'shared': 0.27; 'message-id:@mail.gmail.com': 0.27; 'cc:addr:python.org': 0.27; 'guys': 0.27; 'disk': 0.27; 'this.': 0.28; 'advice': 0.28; 'post': 0.28; 'pm,': 0.28; 'words,': 0.29; 'code': 0.29; 'file': 0.29; 'question': 0.30; 'lines': 0.30; 'shows': 0.30; 'realize': 0.30; 'cc:20': 0.31; 'easier': 0.31; 'experienced': 0.31; 'url:mailman': 0.31; 'received:209.85': 0.32; 'received:209.85.212': 0.32; 'towards': 0.32; 'received:google.com': 0.32; 'guess': 0.32; 'perform': 0.32; 'seconds': 0.32; 'implementing': 0.33; 'this:': 0.33; 'thanks': 0.34; 'url:python': 0.34; 'server': 0.34; 'reading': 0.34; 'received:209': 0.35; 'there': 0.35; 'similar': 0.35; 'should': 0.35; 'actually': 0.35; 'url:listinfo': 0.36; 'sure': 0.36; 'but': 0.36; 'url:org': 0.36; 'list': 0.37; 'appreciate': 0.37; '(in': 0.37; 'so,': 0.37; 'some': 0.37; 'data': 0.38; 'actual': 0.38; 'primary': 0.38; 'something': 0.38; 'comment': 0.38; 'correct': 0.38; 'core': 0.39; 'performance': 0.39; 'clear': 0.39; 'doing': 0.39; 'wanted': 0.39; 'how': 0.40; 'side': 0.60; 'your': 0.60; 'mar': 0.61; 'within': 0.61; 'single': 0.61; 'such': 0.61; 'better': 0.63; 'back': 0.63; 'ways': 0.63; 'more': 0.63; 'program.': 0.64; 'results': 0.65; 'different': 0.65; 'kind': 0.65; 'hope': 0.65; '26,': 0.66; 'suited': 0.66; 'times': 0.66; '2012': 0.69; 'worth': 0.70; 'records': 0.74; 'reaching': 0.76; 'calculations': 0.84; 'coordination': 0.84; 'dispatching': 0.84; 'faster.': 0.84; 'howell': 0.84; 'involved.': 0.84; 'nfs': 0.84; 'subject:reading': 0.84; '\xa0at': 0.84; 'to:addr:yahoo.com': 0.85; 'fast,': 0.91; 'stub.': 0.91
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=DE3ETRdGNLpc+Sz4ValLgG/K43DmpeRa9xF57V9BiA4=; b=jYaVt1NIaYjbvr9FB5aNh5ZSxZRpXmcQelP7pR0GwKqYaAReH2sY5dPsDihyVYfDxy 6lE/mr6tRx4oB+11lTsfrmsaTEbOfiSDU9JGvTAcRICyeDvZymYN2XcFh9iuBWwJs7jE cjxfuTSXNENJ0jTnyhAd7c10YWTywdnywczntm6CpORe8MlcTRqKb/zMPwm/dT+hNDFB AFksj+csgVMTL5hmlqL3Gx+oxIAuiyNSMGdELVgzP9jznQLNPLVDXsRUUC+C3NdJdl/Q xpVRoO4oF/fZ5JyR+aivMRf6h2ZWyMXpUiA/yDbTpIn+pI3fDp/D7CLWpcOQOqNOiKCk Hesg==
MIME-Version	1.0
In-Reply-To	<67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com>
References	<mailman.1023.1332802612.3037.python-list@python.org> <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com>
From	Abhishek Pratap <abhishek.vit@gmail.com>
Date	Mon, 26 Mar 2012 23:08:08 -0700
Subject	Re: concurrent file reading/writing using python
To	Steve Howell <showell30@yahoo.com>
Content-Type	text/plain; charset=ISO-8859-1
Content-Transfer-Encoding	quoted-printable
Cc	python-list@python.org
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.12
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.1030.1332828511.3037.python-list@python.org> (permalink)
Lines	82
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1332828511 news.xs4all.nl 6947 [2001:888:2000:d::a6]:37522
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:22233

Show key headers only | View raw

Thanks for the advice Dennis.

@Steve : I haven't actually written the code. I was thinking more on
the generic side and wanted to check if what I thought made sense and
I now realize it can depend on then the I/O.  For starters I was just
thinking about counting lines in a line without doing any computation
so this can be strictly I/O bound.

I guess what I need to ask was can we improve on the existing disk I/O
performance by reading different portions of the file using threads or
processes. I am kind of pointing towards a MapReduce task on a file in
a shared file system such as GPFS(from IBM). I realize this can be
more suited to HDFS but wanted to know if people have implemented
something similar on a normal linux based NFS

-Abhi


On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showell30@yahoo.com> wrote:
> On Mar 26, 3:56 pm, Abhishek Pratap <abhishek....@gmail.com> wrote:
>> Hi Guys
>>
>> I am fwding this question from the python tutor list in the hope of
>> reaching more people experienced in concurrent disk access in python.
>>
>> I am trying to see if there are ways in which I can read a big file
>> concurrently on a multi core server and process data and write the
>> output to a single file as the data is processed.
>>
>> For example if I have a 50Gb file, I would like to read it in parallel
>> with 10 process/thread, each working on a 10Gb data and perform the
>> same data parallel computation on each chunk of fine collating the
>> output to a single file.
>>
>> I will appreciate your feedback. I did find some threads about this on
>> stackoverflow but it was not clear to me what would be a good  way to
>> go about implementing this.
>>
>
> Have you written a single-core solution to your problem?  If so, can
> you post the code here?
>
> If CPU isn't your primary bottleneck, then you need to be careful not
> to overly complicate your solution by getting multiple cores
> involved.  All the coordination might make your program slower and
> more buggy.
>
> If CPU is the primary bottleneck, then you might want to consider an
> approach where you only have a single thread that's reading records
> from the file, 10 at a time, and then dispatching out the calculations
> to different threads, then writing results back to disk.
>
> My approach would be something like this:
>
>  1) Take a small sample of your dataset so that you can process it
> within 10 seconds or so using a simple, single-core program.
>  2) Figure out whether you're CPU bound.  A simple way to do this is
> to comment out the actual computation or replace it with a trivial
> stub.  If you're CPU bound, the program will run much faster.  If
> you're IO-bound, the program won't run much faster (since all the work
> is actually just reading from disk).
>  3) Figure out how to read 10 records at a time and farm out the
> records to threads.  Hopefully, your program will take significantly
> less time.  At this point, don't obsess over collating data.  It might
> not be 10 times as fast, but it should be somewhat faster to be worth
> your while.
>  4) If the threaded approach shows promise, make sure that you can
> still generate correct output with that approach (in other words,
> figure out out synchronization and collating).
>
> At the end of that experiment, you should have a better feel on where
> to go next.
>
> What is the nature of your computation?  Maybe it would be easier to
> tune the algorithm then figure out the multi-core optimization.
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list

Thread

concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 15:56 -0700
  Re: concurrent file reading/writing using python Steve Howell <showell30@yahoo.com> - 2012-03-26 18:44 -0700
    Re: concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 23:08 -0700
    Re: concurrent file reading/writing using python Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-03-27 09:17 -0400

csiph-web