Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <roy-77E2CD.09190709092011@news.panix.com>
References: <mailman.885.1315522214.27778.python-list@python.org> <c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com> <roy-77E2CD.09190709092011@news.panix.com>
From: Abhishek Pratap <abhishek.vit@gmail.com>
Date: Fri, 9 Sep 2011 10:07:41 -0700
Subject: Re: Processing a file using multithreads
To: Roy Smith <roy@panix.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.908.1315588083.27778.python-list@python.org>
Lines: 58
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:13030

Hi All

@Roy : split in unix sounds good but will it be as efficient as
opening 10 different file handles on a file.  I haven't tried it so
just wondering if you have any experience with it.

Thanks for your input. Also I was not aware of the python's GIL limitation.

My application is not I/O bound as far as I can understand it. Each
line is read and then processed independently of each other. May be
this might sound I/O intensive as #N files will be read but I think if
I have 10 processes running under a parent then it might not be a
bottle neck.

Best,
-Abhi


On Fri, Sep 9, 2011 at 6:19 AM, Roy Smith <roy@panix.com> wrote:
> In article
> <c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com>,
> =A0aspineux <aspineux@gmail.com> wrote:
>
>> On Sep 9, 12:49=A0am, Abhishek Pratap <abhishek....@gmail.com> wrote:
>> > 1. My input file is 10 GB.
>> > 2. I want to open 10 file handles each handling 1 GB of the file
>> > 3. Each file handle is processed in by an individual thread using the
>> > same function ( so total 10 cores are assumed to be available on the
>> > machine)
>> > 4. There will be 10 different output files
>> > 5. once the 10 jobs are complete a reduce kind of function will
>> > combine the output.
>> >
>> > Could you give some ideas ?
>>
>> You can use "multiprocessing" module instead of thread to bypass the
>> GIL limitation.
>
> I agree with this.
>
>> First cut your file in 10 "equal" parts. If it is line based search
>> for the first line close to the cut. Be sure to have "start" and
>> "end" for each parts, start is the address of the first character of
>> the first line and end is one line too much (=3D=3D start of the next
>> block)
>
> How much of the total time will be I/O and how much actual processing?
> Unless your processing is trivial, the I/O time will be relatively
> small. =A0In that case, you might do well to just use the unix
> command-line "split" utility to split the file into pieces first, then
> process the pieces in parallel. =A0Why waste effort getting the
> file-splitting-at-line-boundaries logic correct when somebody has done
> it for you?
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>