Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #22241
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Subject | Re: concurrent file reading/writing using python |
| Date | 2012-03-27 09:17 -0400 |
| Organization | > Bestiaria Support Staff < |
| References | <mailman.1023.1332802612.3037.python-list@python.org> <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com> <CAJbA1KBztxU4LLX7HCQrOqP_5aTw9nP3uZtyV6zVt9JiERKBrQ@mail.gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1037.1332854257.3037.python-list@python.org> (permalink) |
On Mon, 26 Mar 2012 23:08:08 -0700, Abhishek Pratap
<abhishek.vit@gmail.com> declaimed the following in
gmane.comp.python.general:
> I guess what I need to ask was can we improve on the existing disk I/O
> performance by reading different portions of the file using threads or
> processes. I am kind of pointing towards a MapReduce task on a file in
> a shared file system such as GPFS(from IBM). I realize this can be
> more suited to HDFS but wanted to know if people have implemented
> something similar on a normal linux based NFS
>
At the base, /anything/ that forces seeking on a disk is going to
have a negative impact. Pretending that the OS has nothing else
accessing the disk a single reader thread generates something like:
seek to track/block and read directory information to find which blocks
contain the file
seek to first data track/block
read data until end of allocated blocks on this track
step to next track/block locations
repeat
If you spawn multiple threads (say, two thread for example) you end
up with:
1) seek to track/block and read directory information
1) compute offset into file
seek to [offset] track/data location
read block
2) seek to track/block and read directory information
2) compute offset into file
seek to [offset] track/data location
read block
LOOP
1) seek [back] to last read location for this thread
if end of allocated blocks on this track, step to next
track/block
read block
2) seek [back] to last read location for this thread
...
1/2) repeat until end of data
Half your I/O time becomes waiting for the drive head to do seeks
and settle.
As has been suggested, using one master thread to just read from the
file -- sequentially, no jumping around -- and distribute the records
(via some sort of IPC queue) to the worker processes. Depending on the
architecture you might use the same master to collect results and write
them to the output file. A complication: is the output /order/ dependent
upon the order of the input? If it is, then the collector task would
have to block for each worker in sequence even if some have finished
ahead of others.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 15:56 -0700
Re: concurrent file reading/writing using python Steve Howell <showell30@yahoo.com> - 2012-03-26 18:44 -0700
Re: concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 23:08 -0700
Re: concurrent file reading/writing using python Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-03-27 09:17 -0400
csiph-web