Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #22233
| References | <mailman.1023.1332802612.3037.python-list@python.org> <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com> |
|---|---|
| From | Abhishek Pratap <abhishek.vit@gmail.com> |
| Date | 2012-03-26 23:08 -0700 |
| Subject | Re: concurrent file reading/writing using python |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1030.1332828511.3037.python-list@python.org> (permalink) |
Thanks for the advice Dennis. @Steve : I haven't actually written the code. I was thinking more on the generic side and wanted to check if what I thought made sense and I now realize it can depend on then the I/O. For starters I was just thinking about counting lines in a line without doing any computation so this can be strictly I/O bound. I guess what I need to ask was can we improve on the existing disk I/O performance by reading different portions of the file using threads or processes. I am kind of pointing towards a MapReduce task on a file in a shared file system such as GPFS(from IBM). I realize this can be more suited to HDFS but wanted to know if people have implemented something similar on a normal linux based NFS -Abhi On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showell30@yahoo.com> wrote: > On Mar 26, 3:56 pm, Abhishek Pratap <abhishek....@gmail.com> wrote: >> Hi Guys >> >> I am fwding this question from the python tutor list in the hope of >> reaching more people experienced in concurrent disk access in python. >> >> I am trying to see if there are ways in which I can read a big file >> concurrently on a multi core server and process data and write the >> output to a single file as the data is processed. >> >> For example if I have a 50Gb file, I would like to read it in parallel >> with 10 process/thread, each working on a 10Gb data and perform the >> same data parallel computation on each chunk of fine collating the >> output to a single file. >> >> I will appreciate your feedback. I did find some threads about this on >> stackoverflow but it was not clear to me what would be a good way to >> go about implementing this. >> > > Have you written a single-core solution to your problem? If so, can > you post the code here? > > If CPU isn't your primary bottleneck, then you need to be careful not > to overly complicate your solution by getting multiple cores > involved. All the coordination might make your program slower and > more buggy. > > If CPU is the primary bottleneck, then you might want to consider an > approach where you only have a single thread that's reading records > from the file, 10 at a time, and then dispatching out the calculations > to different threads, then writing results back to disk. > > My approach would be something like this: > > 1) Take a small sample of your dataset so that you can process it > within 10 seconds or so using a simple, single-core program. > 2) Figure out whether you're CPU bound. A simple way to do this is > to comment out the actual computation or replace it with a trivial > stub. If you're CPU bound, the program will run much faster. If > you're IO-bound, the program won't run much faster (since all the work > is actually just reading from disk). > 3) Figure out how to read 10 records at a time and farm out the > records to threads. Hopefully, your program will take significantly > less time. At this point, don't obsess over collating data. It might > not be 10 times as fast, but it should be somewhat faster to be worth > your while. > 4) If the threaded approach shows promise, make sure that you can > still generate correct output with that approach (in other words, > figure out out synchronization and collating). > > At the end of that experiment, you should have a better feel on where > to go next. > > What is the nature of your computation? Maybe it would be easier to > tune the algorithm then figure out the multi-core optimization. > > > > > -- > http://mail.python.org/mailman/listinfo/python-list
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 15:56 -0700
Re: concurrent file reading/writing using python Steve Howell <showell30@yahoo.com> - 2012-03-26 18:44 -0700
Re: concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 23:08 -0700
Re: concurrent file reading/writing using python Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-03-27 09:17 -0400
csiph-web