Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #12980 > unrolled thread
| Started by | Abhishek Pratap <abhishek.vit@gmail.com> |
|---|---|
| First post | 2011-09-08 15:49 -0700 |
| Last post | 2011-09-09 22:43 -0700 |
| Articles | 6 — 5 participants |
Back to article view | Back to comp.lang.python
Processing a file using multithreads Abhishek Pratap <abhishek.vit@gmail.com> - 2011-09-08 15:49 -0700
Re: Processing a file using multithreads Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2011-09-09 12:03 +1200
Re: Processing a file using multithreads aspineux <aspineux@gmail.com> - 2011-09-08 21:44 -0700
Re: Processing a file using multithreads Roy Smith <roy@panix.com> - 2011-09-09 09:19 -0400
Re: Processing a file using multithreads Abhishek Pratap <abhishek.vit@gmail.com> - 2011-09-09 10:07 -0700
Re: Processing a file using multithreads Tim Roberts <timr@probo.com> - 2011-09-09 22:43 -0700
| From | Abhishek Pratap <abhishek.vit@gmail.com> |
|---|---|
| Date | 2011-09-08 15:49 -0700 |
| Subject | Processing a file using multithreads |
| Message-ID | <mailman.885.1315522214.27778.python-list@python.org> |
Hi Guys My experience with python is 2 days and I am looking for a slick way to use multi-threading to process a file. Here is what I would like to do which is somewhat similar to MapReduce in concept. # test case 1. My input file is 10 GB. 2. I want to open 10 file handles each handling 1 GB of the file 3. Each file handle is processed in by an individual thread using the same function ( so total 10 cores are assumed to be available on the machine) 4. There will be 10 different output files 5. once the 10 jobs are complete a reduce kind of function will combine the output. Could you give some ideas ? So given a file I would like to read it in #N chunks through #N file handles and process each of them separately. Best, -Abhi
[toc] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2011-09-09 12:03 +1200 |
| Message-ID | <9ct3f4FuvnU1@mid.individual.net> |
| In reply to | #12980 |
Abhishek Pratap wrote: > 3. Each file handle is processed in by an individual thread using the > same function ( so total 10 cores are assumed to be available on the > machine) Are you expecting the processing to be CPU bound or I/O bound? If it's I/O bound, multiple cores won't help you, and neither will threading, because it's the disk doing the work, not the CPU. If it's CPU bound, multiple threads in one Python process won't help, because of the GIL. You'll have to fork multiple OS processes in order to get Python code running in parallel on different cores. -- Greg
[toc] | [prev] | [next] | [standalone]
| From | aspineux <aspineux@gmail.com> |
|---|---|
| Date | 2011-09-08 21:44 -0700 |
| Message-ID | <c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com> |
| In reply to | #12980 |
On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote:
> Hi Guys
>
> My experience with python is 2 days and I am looking for a slick way
> to use multi-threading to process a file. Here is what I would like to
> do which is somewhat similar to MapReduce in concept.
>
> # test case
>
> 1. My input file is 10 GB.
> 2. I want to open 10 file handles each handling 1 GB of the file
> 3. Each file handle is processed in by an individual thread using the
> same function ( so total 10 cores are assumed to be available on the
> machine)
> 4. There will be 10 different output files
> 5. once the 10 jobs are complete a reduce kind of function will
> combine the output.
>
> Could you give some ideas ?
You can use "multiprocessing" module instead of thread to bypass the
GIL limitation.
First cut your file in 10 "equal" parts. If it is line based search
for the first line
close to the cut. Be sure to have "start" and "end" for each parts,
start is the address of the
first character of the first line and end is one line too much (==
start of the next block)
Then use this function to handle each part .
def handle(filename, start, end)
f=open(filename)
f.seek(start)
for l in f:
start+=len(l)
if start>=end:
break
# handle line l here
print l
Do it first in a single process/thread to be sure this is ok (easier
to debug) then split in multi processes
>
> So given a file I would like to read it in #N chunks through #N file
> handles and process each of them separately.
>
> Best,
> -Abhi
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-09-09 09:19 -0400 |
| Message-ID | <roy-77E2CD.09190709092011@news.panix.com> |
| In reply to | #13000 |
In article <c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com>, aspineux <aspineux@gmail.com> wrote: > On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote: > > 1. My input file is 10 GB. > > 2. I want to open 10 file handles each handling 1 GB of the file > > 3. Each file handle is processed in by an individual thread using the > > same function ( so total 10 cores are assumed to be available on the > > machine) > > 4. There will be 10 different output files > > 5. once the 10 jobs are complete a reduce kind of function will > > combine the output. > > > > Could you give some ideas ? > > You can use "multiprocessing" module instead of thread to bypass the > GIL limitation. I agree with this. > First cut your file in 10 "equal" parts. If it is line based search > for the first line close to the cut. Be sure to have "start" and > "end" for each parts, start is the address of the first character of > the first line and end is one line too much (== start of the next > block) How much of the total time will be I/O and how much actual processing? Unless your processing is trivial, the I/O time will be relatively small. In that case, you might do well to just use the unix command-line "split" utility to split the file into pieces first, then process the pieces in parallel. Why waste effort getting the file-splitting-at-line-boundaries logic correct when somebody has done it for you?
[toc] | [prev] | [next] | [standalone]
| From | Abhishek Pratap <abhishek.vit@gmail.com> |
|---|---|
| Date | 2011-09-09 10:07 -0700 |
| Message-ID | <mailman.908.1315588083.27778.python-list@python.org> |
| In reply to | #13024 |
Hi All @Roy : split in unix sounds good but will it be as efficient as opening 10 different file handles on a file. I haven't tried it so just wondering if you have any experience with it. Thanks for your input. Also I was not aware of the python's GIL limitation. My application is not I/O bound as far as I can understand it. Each line is read and then processed independently of each other. May be this might sound I/O intensive as #N files will be read but I think if I have 10 processes running under a parent then it might not be a bottle neck. Best, -Abhi On Fri, Sep 9, 2011 at 6:19 AM, Roy Smith <roy@panix.com> wrote: > In article > <c6cbd486-7e5e-4d26-93b9-088d48a25dea@g9g2000yqb.googlegroups.com>, > aspineux <aspineux@gmail.com> wrote: > >> On Sep 9, 12:49 am, Abhishek Pratap <abhishek....@gmail.com> wrote: >> > 1. My input file is 10 GB. >> > 2. I want to open 10 file handles each handling 1 GB of the file >> > 3. Each file handle is processed in by an individual thread using the >> > same function ( so total 10 cores are assumed to be available on the >> > machine) >> > 4. There will be 10 different output files >> > 5. once the 10 jobs are complete a reduce kind of function will >> > combine the output. >> > >> > Could you give some ideas ? >> >> You can use "multiprocessing" module instead of thread to bypass the >> GIL limitation. > > I agree with this. > >> First cut your file in 10 "equal" parts. If it is line based search >> for the first line close to the cut. Be sure to have "start" and >> "end" for each parts, start is the address of the first character of >> the first line and end is one line too much (== start of the next >> block) > > How much of the total time will be I/O and how much actual processing? > Unless your processing is trivial, the I/O time will be relatively > small. In that case, you might do well to just use the unix > command-line "split" utility to split the file into pieces first, then > process the pieces in parallel. Why waste effort getting the > file-splitting-at-line-boundaries logic correct when somebody has done > it for you? > > -- > http://mail.python.org/mailman/listinfo/python-list > >
[toc] | [prev] | [next] | [standalone]
| From | Tim Roberts <timr@probo.com> |
|---|---|
| Date | 2011-09-09 22:43 -0700 |
| Message-ID | <n4ul67tv8aaktl2mqednke7bbdvmodvncu@4ax.com> |
| In reply to | #13030 |
Abhishek Pratap <abhishek.vit@gmail.com> wrote: > >My application is not I/O bound as far as I can understand it. Each >line is read and then processed independently of each other. May be >this might sound I/O intensive as #N files will be read but I think if >I have 10 processes running under a parent then it might not be a >bottle neck. Your conclusion doesn't follow from your premise. If you are only doing a little bit of processing on each line, then you almost certainly WILL be I/O bound. You will spend most of your time waiting for the disk to deliver more data. In that case, multithreading is not a win. The threads will all compete with each other for the disk. -- Tim Roberts, timr@probo.com Providenza & Boekelheide, Inc.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web