Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Dennis Lee Bieber <wlfraed@ix.netcom.com>
Subject: Re: concurrent file reading/writing using python
Date: Tue, 27 Mar 2012 09:17:38 -0400
Organization: > Bestiaria Support Staff <
References: <mailman.1023.1332802612.3037.python-list@python.org> <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com> <CAJbA1KBztxU4LLX7HCQrOqP_5aTw9nP3uZtyV6zVt9JiERKBrQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1037.1332854257.3037.python-list@python.org>
Lines: 66
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:22241

On Mon, 26 Mar 2012 23:08:08 -0700, Abhishek Pratap
<abhishek.vit@gmail.com> declaimed the following in
gmane.comp.python.general:

> I guess what I need to ask was can we improve on the existing disk I/O
> performance by reading different portions of the file using threads or
> processes. I am kind of pointing towards a MapReduce task on a file in
> a shared file system such as GPFS(from IBM). I realize this can be
> more suited to HDFS but wanted to know if people have implemented
> something similar on a normal linux based NFS
> 

	At the base, /anything/ that forces seeking on a disk is going to
have a negative impact. Pretending that the OS has nothing else
accessing the disk a single reader thread generates something like:

seek to track/block and read directory information to find which blocks
contain the file

seek to first data track/block
	read data until end of allocated blocks on this track
	step to next track/block locations
	repeat

	If you spawn multiple threads (say, two thread for example) you end
up with:

1) seek to track/block and read directory information

1) compute offset into file
	seek to [offset] track/data location
	read block

2) seek to track/block and read directory information

2) compute offset into file
	seek to [offset] track/data location
	read block

LOOP
1)	seek [back] to last read location for this thread
		if end of allocated blocks on this track, step to next
track/block
		read block

2)	seek [back] to last read location for this thread
		...

1/2)	repeat until end of data


	Half your I/O time becomes waiting for the drive head to do seeks
and settle.

	As has been suggested, using one master thread to just read from the
file -- sequentially, no jumping around -- and distribute the records
(via some sort of IPC queue) to the worker processes. Depending on the
architecture you might use the same master to collect results and write
them to the output file. A complication: is the output /order/ dependent
upon the order of the input? If it is, then the collector task would
have to block for each worker in sequence even if some have finished
ahead of others.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/