Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.073 X-Spam-Evidence: '*H*': 0.85; '*S*': 0.00; 'portions': 0.09; 'subject:file': 0.09; 'threads': 0.09; 'subject:python': 0.12; '(via': 0.16; 'abhishek': 0.16; 'compute': 0.16; 'example)': 0.16; 'jumping': 0.16; 'processes.': 0.16; 'received:80.91': 0.16; 'received:80.91.229': 0.16; 'received:gmane.org': 0.16; 'received:list': 0.16; 'spawn': 0.16; 'subject:using': 0.16; 'subject:writing': 0.16; 'suggested,': 0.16; 'worker': 0.16; 'output': 0.18; 'mon,': 0.18; 'file.': 0.22; 'url:home': 0.22; 'distribute': 0.24; 'subject:/': 0.25; 'linux': 0.26; 'shared': 0.27; 'becomes': 0.27; 'dependent': 0.27; 'disk': 0.27; 'forces': 0.29; 'seeks': 0.29; 'file': 0.29; 'realize': 0.30; 'half': 0.31; 'towards': 0.32; 'guess': 0.32; 'architecture': 0.34; 'reading': 0.34; 'sort': 0.35; 'track': 0.35; 'two': 0.35; 'similar': 0.35; 'received:76': 0.35; 'header:X-Complaints-To:1': 0.36; 'reader': 0.36; 'but': 0.36; 'charset:us-ascii': 0.36; 'some': 0.37; 'data': 0.38; 'next': 0.38; 'something': 0.38; 'received:org': 0.38; 'performance': 0.39; 'wanted': 0.39; 'to:addr:python-list': 0.39; 'to:addr:python.org': 0.40; 'your': 0.60; 'collect': 0.61; 'mar': 0.61; 'information': 0.61; 'single': 0.61; 'such': 0.61; 'ahead': 0.63; 'more': 0.63; 'results': 0.65; 'different': 0.65; 'kind': 0.65; 'drive': 0.66; 'accessing': 0.66; 'generates': 0.66; 'suited': 0.66; '2012': 0.69; 'repeat': 0.72; 'records': 0.74; 'loop': 0.79; 'blocks': 0.84; 'dennis': 0.84; 'impact.': 0.84; 'nfs': 0.84; 'subject:reading': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Dennis Lee Bieber Subject: Re: concurrent file reading/writing using python Date: Tue, 27 Mar 2012 09:17:38 -0400 Organization: > Bestiaria Support Staff < References: <67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: adsl-76-253-99-231.dsl.klmzmi.sbcglobal.net X-Newsreader: Forte Agent 3.3/32.846 X-No-Archive: YES X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 66 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1332854257 news.xs4all.nl 6848 [2001:888:2000:d::a6]:49180 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:22241 On Mon, 26 Mar 2012 23:08:08 -0700, Abhishek Pratap declaimed the following in gmane.comp.python.general: > I guess what I need to ask was can we improve on the existing disk I/O > performance by reading different portions of the file using threads or > processes. I am kind of pointing towards a MapReduce task on a file in > a shared file system such as GPFS(from IBM). I realize this can be > more suited to HDFS but wanted to know if people have implemented > something similar on a normal linux based NFS > At the base, /anything/ that forces seeking on a disk is going to have a negative impact. Pretending that the OS has nothing else accessing the disk a single reader thread generates something like: seek to track/block and read directory information to find which blocks contain the file seek to first data track/block read data until end of allocated blocks on this track step to next track/block locations repeat If you spawn multiple threads (say, two thread for example) you end up with: 1) seek to track/block and read directory information 1) compute offset into file seek to [offset] track/data location read block 2) seek to track/block and read directory information 2) compute offset into file seek to [offset] track/data location read block LOOP 1) seek [back] to last read location for this thread if end of allocated blocks on this track, step to next track/block read block 2) seek [back] to last read location for this thread ... 1/2) repeat until end of data Half your I/O time becomes waiting for the drive head to do seeks and settle. As has been suggested, using one master thread to just read from the file -- sequentially, no jumping around -- and distribute the records (via some sort of IPC queue) to the worker processes. Depending on the architecture you might use the same master to collect results and write them to the output file. A complication: is the output /order/ dependent upon the order of the input? If it is, then the collector task would have to block for each worker in sequence even if some have finished ahead of others. -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/