Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!tudelft.nl!txtfeed1.tudelft.nl!multikabel.net!newsfeed20.multikabel.net!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com>
References: <mailman.1023.1332802612.3037.python-list@python.org> <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com>
From: Abhishek Pratap <abhishek.vit@gmail.com>
Date: Mon, 26 Mar 2012 23:08:08 -0700
Subject: Re: concurrent file reading/writing using python
To: Steve Howell <showell30@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1030.1332828511.3037.python-list@python.org>
Lines: 82
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:22233

Thanks for the advice Dennis.

@Steve : I haven't actually written the code. I was thinking more on
the generic side and wanted to check if what I thought made sense and
I now realize it can depend on then the I/O.  For starters I was just
thinking about counting lines in a line without doing any computation
so this can be strictly I/O bound.

I guess what I need to ask was can we improve on the existing disk I/O
performance by reading different portions of the file using threads or
processes. I am kind of pointing towards a MapReduce task on a file in
a shared file system such as GPFS(from IBM). I realize this can be
more suited to HDFS but wanted to know if people have implemented
something similar on a normal linux based NFS

-Abhi


On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showell30@yahoo.com> wrote:
> On Mar 26, 3:56=A0pm, Abhishek Pratap <abhishek....@gmail.com> wrote:
>> Hi Guys
>>
>> I am fwding this question from the python tutor list in the hope of
>> reaching more people experienced in concurrent disk access in python.
>>
>> I am trying to see if there are ways in which I can read a big file
>> concurrently on a multi core server and process data and write the
>> output to a single file as the data is processed.
>>
>> For example if I have a 50Gb file, I would like to read it in parallel
>> with 10 process/thread, each working on a 10Gb data and perform the
>> same data parallel computation on each chunk of fine collating the
>> output to a single file.
>>
>> I will appreciate your feedback. I did find some threads about this on
>> stackoverflow but it was not clear to me what would be a good =A0way to
>> go about implementing this.
>>
>
> Have you written a single-core solution to your problem? =A0If so, can
> you post the code here?
>
> If CPU isn't your primary bottleneck, then you need to be careful not
> to overly complicate your solution by getting multiple cores
> involved. =A0All the coordination might make your program slower and
> more buggy.
>
> If CPU is the primary bottleneck, then you might want to consider an
> approach where you only have a single thread that's reading records
> from the file, 10 at a time, and then dispatching out the calculations
> to different threads, then writing results back to disk.
>
> My approach would be something like this:
>
> =A01) Take a small sample of your dataset so that you can process it
> within 10 seconds or so using a simple, single-core program.
> =A02) Figure out whether you're CPU bound. =A0A simple way to do this is
> to comment out the actual computation or replace it with a trivial
> stub. =A0If you're CPU bound, the program will run much faster. =A0If
> you're IO-bound, the program won't run much faster (since all the work
> is actually just reading from disk).
> =A03) Figure out how to read 10 records at a time and farm out the
> records to threads. =A0Hopefully, your program will take significantly
> less time. =A0At this point, don't obsess over collating data. =A0It migh=
t
> not be 10 times as fast, but it should be somewhat faster to be worth
> your while.
> =A04) If the threaded approach shows promise, make sure that you can
> still generate correct output with that approach (in other words,
> figure out out synchronization and collating).
>
> At the end of that experiment, you should have a better feel on where
> to go next.
>
> What is the nature of your computation? =A0Maybe it would be easier to
> tune the algorithm then figure out the multi-core optimization.
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list