Groups > comp.lang.java.programmer > #19010

Re: Threading model for reading 1,000 files quickly?

Date	2012-10-01 05:00 -0700
From	Patricia Shanahan <pats@acm.org>
Newsgroups	comp.lang.java.programmer
Subject	Re: Threading model for reading 1,000 files quickly?
References	<051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com> <K-SdnU_ujNRnyvTNnZ2dnUVZ7q8AAAAA@bt.com>
Message-ID	<1YednWK6TvBYGPTNnZ2dnUVZ_uudnZ2d@earthlink.com> (permalink)

Show all headers | View raw

On 10/1/2012 1:43 AM, Chris Uppal wrote:
> leegee@gmail.com wrote:
>> I have directory with many sub-directories, each with many thousands of
>> files.
>>
>> I wish to process each file, which takes one or two seconds.
>>
>> I wish to simultaneously process as many files as possible.
>
> Your problem here is not threading, but disk IO.  Specifically disk seeks.  If
> you are using a rotating disk (as opposed to a SSD), and all the files are on
> the same spindle, then using > 1 thread will just slow things down as the
> different thread "fight" to position the disk heads over "their" files.
>
> If you are using more than one spindle (say in a RAID array) then you
> may find benefits in using a similar number of threads.
>
> If the processing is CPU bound rather than IO bound when you are
> processing just one file (doesn't sound like it, but may be true)
> then you can perhaps get benefits by using roughly as many threads
> and you have real cores available to compute.

I agree with the idea that the objective, for rotating disk, should
probably be to optimize use of the disk head's time. I disagree with
the conclusion.

There is no reason to expect the files to be laid out on disk in the
order of requests. It is entirely possible that files N+2, N+3, and N+4
are physically between files N and N+1 for some values of N. Either or
both of the operating system or the disk drive may be optimizing the
request order to reduce head movement. If the scheduling algorithm knows
that all of N through N+4 are needed, it can stop the head at each
track that has one of them and read it as the head is moving from N to N+1.

If you feed the requests to the operating system one at a time, and wait
for each to finish, the disk head will be forced to do the reads in
First-Come-First-Served order, regardless of disk placement. That will
probably not be the optimal order.

If you have too many requests outstanding there is a risk of overloading
the operating system's buffering.

I would suggest either using asynchronous I/O or a thread pool, so that
the number of outstanding requests can be tuned based on measurements. I
will be surprised of the optimal queue length is one.

Patricia

Thread

Threading model for reading 1,000 files quickly? leegee@gmail.com - 2012-10-01 00:11 -0700
  Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-01 09:43 +0100
    Re: Threading model for reading 1,000 files quickly? Patricia Shanahan <pats@acm.org> - 2012-10-01 05:00 -0700
      Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-03 08:24 +0100
        Re: Threading model for reading 1,000 files quickly? Robert Klemme <shortcutter@googlemail.com> - 2012-10-03 13:58 +0200
    Re: Threading model for reading 1,000 files quickly? markspace <-@.> - 2012-10-01 09:35 -0700
  Re: Threading model for reading 1,000 files quickly? Eric Sosman <esosman@ieee-dot-org.invalid> - 2012-10-01 09:32 -0400
  Re: Threading model for reading 1,000 files quickly? Kevin McMurtrie <mcmurtrie@pixelmemory.us> - 2012-10-01 20:11 -0700

csiph-web