Groups > comp.lang.python > #12980 > unrolled thread

Processing a file using multithreads

Started by	Abhishek Pratap <abhishek.vit@gmail.com>
First post	2011-09-08 15:49 -0700
Last post	2011-09-09 22:43 -0700
Articles	6 — 5 participants

Back to article view | Back to comp.lang.python

  Processing a file using multithreads Abhishek Pratap <abhishek.vit@gmail.com> - 2011-09-08 15:49 -0700
    Re: Processing a file using multithreads Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2011-09-09 12:03 +1200
    Re: Processing a file using multithreads aspineux <aspineux@gmail.com> - 2011-09-08 21:44 -0700
      Re: Processing a file using multithreads Roy Smith <roy@panix.com> - 2011-09-09 09:19 -0400
        Re: Processing a file using multithreads Abhishek Pratap <abhishek.vit@gmail.com> - 2011-09-09 10:07 -0700
          Re: Processing a file using multithreads Tim Roberts <timr@probo.com> - 2011-09-09 22:43 -0700

#12980 — Processing a file using multithreads

From	Abhishek Pratap <abhishek.vit@gmail.com>
Date	2011-09-08 15:49 -0700
Subject	Processing a file using multithreads
Message-ID	<mailman.885.1315522214.27778.python-list@python.org>

Hi Guys

My experience with python is 2 days and I am looking for a slick way
to use multi-threading to process a file. Here is what I would like to
do which is somewhat similar to MapReduce in concept.

# test case

1. My input file is 10 GB.
2. I want to open 10 file handles each handling 1 GB of the file
3. Each file handle is processed in by an individual thread using the
same function ( so total 10 cores are assumed to be available on the
machine)
4. There will be 10 different output files
5. once the 10 jobs are complete a reduce kind of function will
combine the output.

Could you give some ideas ?

So given a file I would like to read it in #N chunks through #N file
handles and process each of them separately.

Best,
-Abhi

[toc] | [next] | [standalone]

#12983

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2011-09-09 12:03 +1200
Message-ID	<9ct3f4FuvnU1@mid.individual.net>
In reply to	#12980

Abhishek Pratap wrote:

> 3. Each file handle is processed in by an individual thread using the
> same function ( so total 10 cores are assumed to be available on the
> machine)

Are you expecting the processing to be CPU bound or
I/O bound?

If it's I/O bound, multiple cores won't help you, and
neither will threading, because it's the disk doing the
work, not the CPU.

If it's CPU bound, multiple threads in one Python process
won't help, because of the GIL. You'll have to fork
multiple OS processes in order to get Python code running
in parallel on different cores.

-- 
Greg

[toc] | [prev] | [next] | [standalone]

#13000

From	aspineux <aspineux@gmail.com>
Date	2011-09-08 21:44 -0700
Message-ID	<c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com>
In reply to	#12980

On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote:
> Hi Guys
>
> My experience with python is 2 days and I am looking for a slick way
> to use multi-threading to process a file. Here is what I would like to
> do which is somewhat similar to MapReduce in concept.
>
> # test case
>
> 1. My input file is 10 GB.
> 2. I want to open 10 file handles each handling 1 GB of the file
> 3. Each file handle is processed in by an individual thread using the
> same function ( so total 10 cores are assumed to be available on the
> machine)
> 4. There will be 10 different output files
> 5. once the 10 jobs are complete a reduce kind of function will
> combine the output.
>
> Could you give some ideas ?

You can use "multiprocessing" module instead of thread to bypass the
GIL limitation.

First cut your file in 10 "equal" parts. If it is line based search
for the first line
close to the cut. Be sure to have "start" and "end" for each parts,
start is the address of the
first character of the first line and end is one line too much (==
start of the next block)

Then use this function to handle each part .

def handle(filename, start, end)
  f=open(filename)
  f.seek(start)
  for l in f:
    start+=len(l)
    if start>=end:
      break
    # handle line l here
    print l

Do it first in a single process/thread to be sure this is ok (easier
to debug) then split in multi processes


>
> So given a file I would like to read it in #N chunks through #N file
> handles and process each of them separately.
>
> Best,
> -Abhi

[toc] | [prev] | [next] | [standalone]

#13024

From	Roy Smith <roy@panix.com>
Date	2011-09-09 09:19 -0400
Message-ID	<roy-77E2CD.09190709092011@news.panix.com>
In reply to	#13000

In article 
<c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com>,
 aspineux <aspineux@gmail.com> wrote:

> On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote:
> > 1. My input file is 10 GB.
> > 2. I want to open 10 file handles each handling 1 GB of the file
> > 3. Each file handle is processed in by an individual thread using the
> > same function ( so total 10 cores are assumed to be available on the
> > machine)
> > 4. There will be 10 different output files
> > 5. once the 10 jobs are complete a reduce kind of function will
> > combine the output.
> >
> > Could you give some ideas ?
> 
> You can use "multiprocessing" module instead of thread to bypass the
> GIL limitation.

I agree with this.

> First cut your file in 10 "equal" parts. If it is line based search 
> for the first line close to the cut. Be sure to have "start" and 
> "end" for each parts, start is the address of the first character of 
> the first line and end is one line too much (== start of the next 
> block)

How much of the total time will be I/O and how much actual processing?  
Unless your processing is trivial, the I/O time will be relatively 
small.  In that case, you might do well to just use the unix 
command-line "split" utility to split the file into pieces first, then 
process the pieces in parallel.  Why waste effort getting the 
file-splitting-at-line-boundaries logic correct when somebody has done 
it for you?

[toc] | [prev] | [next] | [standalone]

#13030

From	Abhishek Pratap <abhishek.vit@gmail.com>
Date	2011-09-09 10:07 -0700
Message-ID	<mailman.908.1315588083.27778.python-list@python.org>
In reply to	#13024

Hi All

@Roy : split in unix sounds good but will it be as efficient as
opening 10 different file handles on a file.  I haven't tried it so
just wondering if you have any experience with it.

Thanks for your input. Also I was not aware of the python's GIL limitation.

My application is not I/O bound as far as I can understand it. Each
line is read and then processed independently of each other. May be
this might sound I/O intensive as #N files will be read but I think if
I have 10 processes running under a parent then it might not be a
bottle neck.

Best,
-Abhi


On Fri, Sep 9, 2011 at 6:19 AM, Roy Smith <roy@panix.com> wrote:
> In article
> <c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com>,
>  aspineux <aspineux@gmail.com> wrote:
>
>> On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote:
>> > 1. My input file is 10 GB.
>> > 2. I want to open 10 file handles each handling 1 GB of the file
>> > 3. Each file handle is processed in by an individual thread using the
>> > same function ( so total 10 cores are assumed to be available on the
>> > machine)
>> > 4. There will be 10 different output files
>> > 5. once the 10 jobs are complete a reduce kind of function will
>> > combine the output.
>> >
>> > Could you give some ideas ?
>>
>> You can use "multiprocessing" module instead of thread to bypass the
>> GIL limitation.
>
> I agree with this.
>
>> First cut your file in 10 "equal" parts. If it is line based search
>> for the first line close to the cut. Be sure to have "start" and
>> "end" for each parts, start is the address of the first character of
>> the first line and end is one line too much (== start of the next
>> block)
>
> How much of the total time will be I/O and how much actual processing?
> Unless your processing is trivial, the I/O time will be relatively
> small.  In that case, you might do well to just use the unix
> command-line "split" utility to split the file into pieces first, then
> process the pieces in parallel.  Why waste effort getting the
> file-splitting-at-line-boundaries logic correct when somebody has done
> it for you?
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>

[toc] | [prev] | [next] | [standalone]

#13049

From	Tim Roberts <timr@probo.com>
Date	2011-09-09 22:43 -0700
Message-ID	<n4ul67tv8aaktl2mqednke7bbdvmodvncu@4ax.com>
In reply to	#13030

Abhishek Pratap <abhishek.vit@gmail.com> wrote:
>
>My application is not I/O bound as far as I can understand it. Each
>line is read and then processed independently of each other. May be
>this might sound I/O intensive as #N files will be read but I think if
>I have 10 processes running under a parent then it might not be a
>bottle neck.

Your conclusion doesn't follow from your premise.  If you are only doing a
little bit of processing on each line, then you almost certainly WILL be
I/O bound.  You will spend most of your time waiting for the disk to
deliver more data.  In that case, multithreading is not a win.  The threads
will all compete with each other for the disk.
-- 
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

[toc] | [prev] | [standalone]

csiph-web

Processing a file using multithreads

Contents

#12980 — Processing a file using multithreads

#12983

#13000

#13024

#13030

#13049