Groups > comp.lang.python > #22214 > unrolled thread

concurrent file reading/writing using python

Started by	Abhishek Pratap <abhishek.vit@gmail.com>
First post	2012-03-26 15:56 -0700
Last post	2012-03-27 09:17 -0400
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 15:56 -0700
    Re: concurrent file reading/writing using python Steve Howell <showell30@yahoo.com> - 2012-03-26 18:44 -0700
      Re: concurrent file reading/writing using python Abhishek Pratap <abhishek.vit@gmail.com> - 2012-03-26 23:08 -0700
      Re: concurrent file reading/writing using python Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-03-27 09:17 -0400

#22214 — concurrent file reading/writing using python

From	Abhishek Pratap <abhishek.vit@gmail.com>
Date	2012-03-26 15:56 -0700
Subject	concurrent file reading/writing using python
Message-ID	<mailman.1023.1332802612.3037.python-list@python.org>

Hi Guys

I am fwding this question from the python tutor list in the hope of
reaching more people experienced in concurrent disk access in python.

I am trying to see if there are ways in which I can read a big file
concurrently on a multi core server and process data and write the
output to a single file as the data is processed.

For example if I have a 50Gb file, I would like to read it in parallel
with 10 process/thread, each working on a 10Gb data and perform the
same data parallel computation on each chunk of fine collating the
output to a single file.

I will appreciate your feedback. I did find some threads about this on
stackoverflow but it was not clear to me what would be a good  way to
go about implementing this.

Thanks!
-Abhi

---------- Forwarded message ----------
From: Steven D'Aprano <steve@pearwood.info>
Date: Mon, Mar 26, 2012 at 3:21 PM
Subject: Re: [Tutor] concurrent file reading using python
To: tutor@python.org

Abhishek Pratap wrote:
>
> Hi Guys
>
>
> I want to utilize the power of cores on my server and read big files
> (> 50Gb) simultaneously by seeking to N locations.

Yes, you have many cores on the server. But how many hard drives is
each file on? If all the files are on one disk, then you will *kill*
performance dead by forcing the drive to seek backwards and forwards:

seek to 12345678
read a block
seek to 9947500
read a block
seek to 5891124
read a block
seek back to 12345678 + 1 block
read another block
seek back to 9947500 + 1 block
read another block
...

The drive will spend most of its time seeking instead of reading.

Even if you have multiple hard drives in a RAID array, performance
will depend strongly the details of how it is configured (RAID1,
RAID0, software RAID, hardware RAID, etc.) and how smart the
controller is.

Chances are, though, that the controller won't be smart enough.
Particularly if you have hardware RAID, which in my experience tends
to be more expensive and less useful than software RAID (at least for
Linux).

And what are you planning on doing with the files once you have read
them? I don't know how much memory your server has got, but I'd be
very surprised if you can fit the entire > 50 GB file in RAM at once.
So you're going to read the files and merge the output... by writing
them to the disk. Now you have the drive trying to read *and* write
simultaneously.

TL; DR:

Tasks which are limited by disk IO are not made faster by using a
faster CPU, since the bottleneck is disk access, not CPU speed.

Back in the Ancient Days when tape was the only storage medium, there
were a lot of programs optimised for slow IO. Unfortunately this is
pretty much a lost art -- although disk access is thousands or tens of
thousands of times slower than memory access, it is so much faster
than tape that people don't seem to care much about optimising disk
access.

> What I want to know is the best way to read a file concurrently. I
> have read about file-handle.seek(),  os.lseek() but not sure if thats
> the way to go. Any used cases would be of help.

Optimising concurrent disk access is a specialist field. You may be
better off asking for help on the main Python list, comp.lang.python
or python-list@python.org, and hope somebody has some experience with
this. But chances are very high that you will need to search the web
for forums dedicated to concurrent disk access, and translate from
whatever language(s) they are using to Python.

--
Steven

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[toc] | [next] | [standalone]

#22220

From	Steve Howell <showell30@yahoo.com>
Date	2012-03-26 18:44 -0700
Message-ID	<67283b47-9d44-403e-a3df-73ade83a2c0e@z3g2000pbn.googlegroups.com>
In reply to	#22214

On Mar 26, 3:56 pm, Abhishek Pratap <abhishek....@gmail.com> wrote:
> Hi Guys
>
> I am fwding this question from the python tutor list in the hope of
> reaching more people experienced in concurrent disk access in python.
>
> I am trying to see if there are ways in which I can read a big file
> concurrently on a multi core server and process data and write the
> output to a single file as the data is processed.
>
> For example if I have a 50Gb file, I would like to read it in parallel
> with 10 process/thread, each working on a 10Gb data and perform the
> same data parallel computation on each chunk of fine collating the
> output to a single file.
>
> I will appreciate your feedback. I did find some threads about this on
> stackoverflow but it was not clear to me what would be a good  way to
> go about implementing this.
>

Have you written a single-core solution to your problem?  If so, can
you post the code here?

If CPU isn't your primary bottleneck, then you need to be careful not
to overly complicate your solution by getting multiple cores
involved.  All the coordination might make your program slower and
more buggy.

If CPU is the primary bottleneck, then you might want to consider an
approach where you only have a single thread that's reading records
from the file, 10 at a time, and then dispatching out the calculations
to different threads, then writing results back to disk.

My approach would be something like this:

  1) Take a small sample of your dataset so that you can process it
within 10 seconds or so using a simple, single-core program.
  2) Figure out whether you're CPU bound.  A simple way to do this is
to comment out the actual computation or replace it with a trivial
stub.  If you're CPU bound, the program will run much faster.  If
you're IO-bound, the program won't run much faster (since all the work
is actually just reading from disk).
  3) Figure out how to read 10 records at a time and farm out the
records to threads.  Hopefully, your program will take significantly
less time.  At this point, don't obsess over collating data.  It might
not be 10 times as fast, but it should be somewhat faster to be worth
your while.
  4) If the threaded approach shows promise, make sure that you can
still generate correct output with that approach (in other words,
figure out out synchronization and collating).

At the end of that experiment, you should have a better feel on where
to go next.

What is the nature of your computation?  Maybe it would be easier to
tune the algorithm then figure out the multi-core optimization.

[toc] | [prev] | [next] | [standalone]

#22233

From	Abhishek Pratap <abhishek.vit@gmail.com>
Date	2012-03-26 23:08 -0700
Message-ID	<mailman.1030.1332828511.3037.python-list@python.org>
In reply to	#22220

Thanks for the advice Dennis.

@Steve : I haven't actually written the code. I was thinking more on
the generic side and wanted to check if what I thought made sense and
I now realize it can depend on then the I/O.  For starters I was just
thinking about counting lines in a line without doing any computation
so this can be strictly I/O bound.

I guess what I need to ask was can we improve on the existing disk I/O
performance by reading different portions of the file using threads or
processes. I am kind of pointing towards a MapReduce task on a file in
a shared file system such as GPFS(from IBM). I realize this can be
more suited to HDFS but wanted to know if people have implemented
something similar on a normal linux based NFS

-Abhi


On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showell30@yahoo.com> wrote:
> On Mar 26, 3:56 pm, Abhishek Pratap <abhishek....@gmail.com> wrote:
>> Hi Guys
>>
>> I am fwding this question from the python tutor list in the hope of
>> reaching more people experienced in concurrent disk access in python.
>>
>> I am trying to see if there are ways in which I can read a big file
>> concurrently on a multi core server and process data and write the
>> output to a single file as the data is processed.
>>
>> For example if I have a 50Gb file, I would like to read it in parallel
>> with 10 process/thread, each working on a 10Gb data and perform the
>> same data parallel computation on each chunk of fine collating the
>> output to a single file.
>>
>> I will appreciate your feedback. I did find some threads about this on
>> stackoverflow but it was not clear to me what would be a good  way to
>> go about implementing this.
>>
>
> Have you written a single-core solution to your problem?  If so, can
> you post the code here?
>
> If CPU isn't your primary bottleneck, then you need to be careful not
> to overly complicate your solution by getting multiple cores
> involved.  All the coordination might make your program slower and
> more buggy.
>
> If CPU is the primary bottleneck, then you might want to consider an
> approach where you only have a single thread that's reading records
> from the file, 10 at a time, and then dispatching out the calculations
> to different threads, then writing results back to disk.
>
> My approach would be something like this:
>
>  1) Take a small sample of your dataset so that you can process it
> within 10 seconds or so using a simple, single-core program.
>  2) Figure out whether you're CPU bound.  A simple way to do this is
> to comment out the actual computation or replace it with a trivial
> stub.  If you're CPU bound, the program will run much faster.  If
> you're IO-bound, the program won't run much faster (since all the work
> is actually just reading from disk).
>  3) Figure out how to read 10 records at a time and farm out the
> records to threads.  Hopefully, your program will take significantly
> less time.  At this point, don't obsess over collating data.  It might
> not be 10 times as fast, but it should be somewhat faster to be worth
> your while.
>  4) If the threaded approach shows promise, make sure that you can
> still generate correct output with that approach (in other words,
> figure out out synchronization and collating).
>
> At the end of that experiment, you should have a better feel on where
> to go next.
>
> What is the nature of your computation?  Maybe it would be easier to
> tune the algorithm then figure out the multi-core optimization.
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#22241

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2012-03-27 09:17 -0400
Message-ID	<mailman.1037.1332854257.3037.python-list@python.org>
In reply to	#22220

On Mon, 26 Mar 2012 23:08:08 -0700, Abhishek Pratap
<abhishek.vit@gmail.com> declaimed the following in
gmane.comp.python.general:

> I guess what I need to ask was can we improve on the existing disk I/O
> performance by reading different portions of the file using threads or
> processes. I am kind of pointing towards a MapReduce task on a file in
> a shared file system such as GPFS(from IBM). I realize this can be
> more suited to HDFS but wanted to know if people have implemented
> something similar on a normal linux based NFS
> 

	At the base, /anything/ that forces seeking on a disk is going to
have a negative impact. Pretending that the OS has nothing else
accessing the disk a single reader thread generates something like:

seek to track/block and read directory information to find which blocks
contain the file

seek to first data track/block
	read data until end of allocated blocks on this track
	step to next track/block locations
	repeat

	If you spawn multiple threads (say, two thread for example) you end
up with:

1) seek to track/block and read directory information

1) compute offset into file
	seek to [offset] track/data location
	read block

2) seek to track/block and read directory information

2) compute offset into file
	seek to [offset] track/data location
	read block

LOOP
1)	seek [back] to last read location for this thread
		if end of allocated blocks on this track, step to next
track/block
		read block

2)	seek [back] to last read location for this thread
		...

1/2)	repeat until end of data


	Half your I/O time becomes waiting for the drive head to do seeks
and settle.

	As has been suggested, using one master thread to just read from the
file -- sequentially, no jumping around -- and distribute the records
(via some sort of IPC queue) to the worker processes. Depending on the
architecture you might use the same master to collect results and write
them to the output file. A complication: is the output /order/ dependent
upon the order of the input? If it is, then the collector task would
have to block for each worker in sequence even if some have finished
ahead of others.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

csiph-web

concurrent file reading/writing using python

Contents

#22214 — concurrent file reading/writing using python

#22220

#22233

#22241