Groups > comp.lang.java.programmer > #19007 > unrolled thread

Threading model for reading 1,000 files quickly?

Started by	leegee@gmail.com
First post	2012-10-01 00:11 -0700
Last post	2012-10-01 20:11 -0700
Articles	8 — 7 participants

Back to article view | Back to comp.lang.java.programmer

  Threading model for reading 1,000 files quickly? leegee@gmail.com - 2012-10-01 00:11 -0700
    Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-01 09:43 +0100
      Re: Threading model for reading 1,000 files quickly? Patricia Shanahan <pats@acm.org> - 2012-10-01 05:00 -0700
        Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-03 08:24 +0100
          Re: Threading model for reading 1,000 files quickly? Robert Klemme <shortcutter@googlemail.com> - 2012-10-03 13:58 +0200
      Re: Threading model for reading 1,000 files quickly? markspace <-@.> - 2012-10-01 09:35 -0700
    Re: Threading model for reading 1,000 files quickly? Eric Sosman <esosman@ieee-dot-org.invalid> - 2012-10-01 09:32 -0400
    Re: Threading model for reading 1,000 files quickly? Kevin McMurtrie <mcmurtrie@pixelmemory.us> - 2012-10-01 20:11 -0700

#19007 — Threading model for reading 1,000 files quickly?

From	leegee@gmail.com
Date	2012-10-01 00:11 -0700
Subject	Threading model for reading 1,000 files quickly?
Message-ID	<051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com>

I have directory with many sub-directories, each with many thousands of files.

I wish to process each file, which takes one or two seconds.

I wish to simultaneously process as many files as possible.

I'm not clear the best approach to this.

Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?

TIA
Lee

[toc] | [next] | [standalone]

#19009

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-10-01 09:43 +0100
Message-ID	<K-SdnU_ujNRnyvTNnZ2dnUVZ7q8AAAAA@bt.com>
In reply to	#19007

leegee@gmail.com wrote:
> I have directory with many sub-directories, each with many thousands of
> files.
>
> I wish to process each file, which takes one or two seconds.
>
> I wish to simultaneously process as many files as possible.

Your problem here is not threading, but disk IO.  Specifically disk seeks.  If 
you are using a rotating disk (as opposed to a SSD), and all the files are on 
the same spindle, then using > 1 thread will just slow things down as the 
different thread "fight" to position the disk heads over "their" files.

If you are using more than one spindle (say in a RAID array) then you may find 
benefits in using a similar number of threads.

If the processing is CPU bound rather than IO bound when you are processing 
just one file (doesn't sound like it, but may be true) then you can perhaps get 
benefits by using roughly as many threads and you have real cores available to 
compute.

    -- chris

[toc] | [prev] | [next] | [standalone]

#19010

From	Patricia Shanahan <pats@acm.org>
Date	2012-10-01 05:00 -0700
Message-ID	<1YednWK6TvBYGPTNnZ2dnUVZ_uudnZ2d@earthlink.com>
In reply to	#19009

On 10/1/2012 1:43 AM, Chris Uppal wrote:
> leegee@gmail.com wrote:
>> I have directory with many sub-directories, each with many thousands of
>> files.
>>
>> I wish to process each file, which takes one or two seconds.
>>
>> I wish to simultaneously process as many files as possible.
>
> Your problem here is not threading, but disk IO.  Specifically disk seeks.  If
> you are using a rotating disk (as opposed to a SSD), and all the files are on
> the same spindle, then using > 1 thread will just slow things down as the
> different thread "fight" to position the disk heads over "their" files.
>
> If you are using more than one spindle (say in a RAID array) then you
> may find benefits in using a similar number of threads.
>
> If the processing is CPU bound rather than IO bound when you are
> processing just one file (doesn't sound like it, but may be true)
> then you can perhaps get benefits by using roughly as many threads
> and you have real cores available to compute.

I agree with the idea that the objective, for rotating disk, should
probably be to optimize use of the disk head's time. I disagree with
the conclusion.

There is no reason to expect the files to be laid out on disk in the
order of requests. It is entirely possible that files N+2, N+3, and N+4
are physically between files N and N+1 for some values of N. Either or
both of the operating system or the disk drive may be optimizing the
request order to reduce head movement. If the scheduling algorithm knows
that all of N through N+4 are needed, it can stop the head at each
track that has one of them and read it as the head is moving from N to N+1.

If you feed the requests to the operating system one at a time, and wait
for each to finish, the disk head will be forced to do the reads in
First-Come-First-Served order, regardless of disk placement. That will
probably not be the optimal order.

If you have too many requests outstanding there is a risk of overloading
the operating system's buffering.

I would suggest either using asynchronous I/O or a thread pool, so that
the number of outstanding requests can be tuned based on measurements. I
will be surprised of the optimal queue length is one.

Patricia

[toc] | [prev] | [next] | [standalone]

#19073

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-10-03 08:24 +0100
Message-ID	<jJmdnZ-kl6IJdfbNnZ2dnUVZ7rCdnZ2d@bt.com>
In reply to	#19010

Patricia Shanahan wrote:

> > [me]
> > If you are using more than one spindle (say in a RAID array) then you
> > may find benefits in using a similar number of threads.
> >
> > If the processing is CPU bound rather than IO bound when you are
> > processing just one file (doesn't sound like it, but may be true)
> > then you can perhaps get benefits by using roughly as many threads
> > and you have real cores available to compute.
>
> I agree with the idea that the objective, for rotating disk, should
> probably be to optimize use of the disk head's time. I disagree with
> the conclusion.
>
> There is no reason to expect the files to be laid out on disk in the
> order of requests. It is entirely possible that files N+2, N+3, and N+4
> are physically between files N and N+1 for some values of N. Either or
> both of the operating system or the disk drive may be optimizing the
> request order to reduce head movement. If the scheduling algorithm knows
> that all of N through N+4 are needed, it can stop the head at each
> track that has one of them and read it as the head is moving from N to
> N+1.

I must admit that I had forgotten that aspect of the situation.  And my advice 
should certainly be amended to include the possibility that using some extra 
threads might help.  Thank you.

But -- at the risk of sounding merely defensive -- I doubt whether it would 
work in practice.

Consider: would you choose the time when you've got a big disk operation 
running (copying a huge number of files say) to kick off a virus scan on the 
same spindle ?   I most certainly would not, perhaps your experience has been 
different.

The problem is that the analysis in terms of scattered disk blocks is 
unrealistic.  If the blocks of each file are actually randomised across the 
disk, then the analysis works.  But in that case a simple defrag seems to make 
more sense to me.   If, on the other hand, the block/s/ in most files are 
mostly contiguous, and each thread is processing those blocks mostly 
sequentially, then running even two threads will turn the fast sequential 
access pattern

    B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m

into something more like:

    B+0, C+0, B+1, C+1, ... B+n, C+m

which is a disaster.  Of course, if you are running 3 or more threads then the 
impact of the jumps might well be reduced, but even if we have, say:

    B+0, C+0, D+0 B+1, D+1, C+1, ... B+n, C+m, D+p

with C+k, B+k, and D+k ordered optimally, we still aren't going to get anything 
like the pure sequential performance of the non-threaded version.

Of course, my analysis also depends on assumptions about the actual files and 
their layout, but I don't think the assumptions are unreasonable.  In fact, in 
the absence of more specific data, I'd call 'em good ;-)

    -- chris

[toc] | [prev] | [next] | [standalone]

#19075

From	Robert Klemme <shortcutter@googlemail.com>
Date	2012-10-03 13:58 +0200
Message-ID	<ad2njtFh63kU1@mid.individual.net>
In reply to	#19073

On 03.10.2012 09:24, Chris Uppal wrote:

> I must admit that I had forgotten that aspect of the situation.

To me it seems there are a lot more "forgotten aspects"...

> Consider: would you choose the time when you've got a big disk operation
> running (copying a huge number of files say) to kick off a virus scan on the
> same spindle ?   I most certainly would not, perhaps your experience has been
> different.

File copying is only IO with negligible CPU, virus scanning only looks 
at portions of files.  We do not know whether that scenario only 
remotely resembles the problem the OP is trying to tackle.

> The problem is that the analysis in terms of scattered disk blocks is
> unrealistic.If the blocks of each file are actually randomised across the
> disk, then the analysis works.  But in that case a simple defrag seems to make
> more sense to me.

Not all file systems support online or offline defragmentation and we do 
not even yet know the file system.  Heck, files may actually reside on 
some type of network share or on a RAID storage with it's own caching 
and read strategies.  Also, since defragmentation usually works on a 
whole file system the cost overhead might not pay off at all.  Btw. 
another fact we do not know yet (as far as I can see) is whether this is 
a one off thing or the processing should be done repeatedly (in case of 
one off the whole discussion is superfluous as it costs more time than 
the overhead of a sub optimal IO and threading strategy).  It may also 
make sense to know how files get there (maybe it's even more efficient 
to fetch files in Java with a HTTP client from where they are taken and 
process them while downloading, i.e. without ever writing them to disk).

>  If, on the other hand, the block/s/ in most files are
> mostly contiguous, and each thread is processing those blocks mostly
> sequentially, then running even two threads will turn the fast sequential
> access pattern
>
>      B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m
>
> into something more like:
>
>      B+0, C+0, B+1, C+1, ... B+n, C+m
>
> which is a disaster.

We cannot know.  First of all we do not know the size of files, do we? 
So files might actually take up just one block.  Then, the operating 
system might actually be prefetching blocks of individual files when it 
detects the access pattern (reading in one go from head to tail) to fill 
the cache even before blocks are accessed.

Oh, and btw., we do not even know the read pattern, do we?  Are files 
read from beginning to end?  Are they accessed more in a random access 
fashion?  And we do not know the nature of the processing either.  At 
the moment we just know that it takes one to two seconds (on what 
hardware and OS?) - but we do not know whether that is because of CPU 
load or IO load etc.

> Of course, my analysis also depends on assumptions about the actual files and
> their layout, but I don't think the assumptions are unreasonable.  In fact, in
> the absence of more specific data, I'd call 'em good ;-)

That's a bold statement.  You call an analysis "good" which just fills 
in unmentioned assumptions for missing facts - a lot of missing facts.

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#19014

From	markspace <-@.>
Date	2012-10-01 09:35 -0700
Message-ID	<k4cgoa$uik$1@dont-email.me>
In reply to	#19009

On 10/1/2012 1:43 AM, Chris Uppal wrote:

>
> Your problem here is not threading, but disk IO.  Specifically disk seeks.  If
> you are using a rotating disk (as opposed to a SSD), and all the files are on
> the same spindle, then using > 1 thread will just slow things down as the
> different thread "fight" to position the disk heads over "their" files.
>

I gotta disagree also.  It's actually well known trick to "flood" a disk 
controller with requests, so that the disk controller can organize and 
optimize the head read/write access.  Especially if the files are small, 
it's possible the disk controller might be able to read multiple files 
in multiple blocks in a single revolution of the disk platter. You can't 
do that if you only open one file;  the disk controller would only read 
the one file in that case.  Multiple threads are probably the easiest 
way to do this.

Yes, if performance is critical, you might investigate a single thread 
and non-blocking IO.  Less the overhead of multiple threads, that might 
actually be faster.  But since IO is still the limiting factor, I think 
the speed increase would be small.

It also might be worth it for the OP to test their own performance. 
There's lots of ways here that we could be wrong, given unexpected 
system features.  SingleThreadedExecutor vs. CachedThreadPool should 
make this easy to test.

[toc] | [prev] | [next] | [standalone]

#19011

From	Eric Sosman <esosman@ieee-dot-org.invalid>
Date	2012-10-01 09:32 -0400
Message-ID	<k4c62a$o26$1@dont-email.me>
In reply to	#19007

On 10/1/2012 3:11 AM, leegee@gmail.com wrote:
> I have directory with many sub-directories, each with many thousands of files.
>
> I wish to process each file, which takes one or two seconds.

     "Many" sub-directories (let's say a hundred) times "many
thousands" of files each (let's say ten thousand) times "one or
two" seconds per file (let's say 1.5) -- Okay, that's about two
and a half weeks if you process them one at a time.

     If two and a half weeks is an acceptable amount of time, you
should probably turn your attention to things like checkpointing
of partial results, so that a power failure or crash on Day Nine
doesn't force you to restart from the very beginning.

> I wish to simultaneously process as many files as possible.

     "As many as possible" -- With what kind of equipment?  What's
"possible" for an Amazon data center might be beyond the reach
of an Amazon Kindle ...

> I'm not clear the best approach to this.
>
> Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?

     You should use a thousand machines, each doing a thousandth
of the overall job.  They'll finish in about twenty minutes.

     (If you want more definite answers, you'll need to provide a
more definite problem statement.)

-- 
Eric Sosman
esosman@ieee-dot-org.invalid

[toc] | [prev] | [next] | [standalone]

#19034

From	Kevin McMurtrie <mcmurtrie@pixelmemory.us>
Date	2012-10-01 20:11 -0700
Message-ID	<506a5b68$0$71178$742ec2ed@news.sonic.net>
In reply to	#19007

In article <051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com>,
 leegee@gmail.com wrote:

> I have directory with many sub-directories, each with many thousands of 
> files.
> 
> I wish to process each file, which takes one or two seconds.
> 
> I wish to simultaneously process as many files as possible.
> 
> I'm not clear the best approach to this.
> 
> Should I use a thread pool? Is Java capable of maximising the hardware 
> resources to determine how many threads it can simultaneously execute?
> 
> TIA
> Lee

Your main thread creates several worker threads, recurses the directory 
structure, creates a task to process each file, and puts each task in a 
shared BlockingQueue having a bounded size.  When all files have been 
queued, a special END marker is placed on the end of the BlockingQueue 
and join() is called for each worker thread.

Each worker thread pulls one task from the shared BlockingQueue, 
executes it, and loops to pull another task.  When the END marker is 
seen by a thread, it is placed back into the BlockingQueue and the 
thread exits.

An alternative is to use a customized ThreadPoolExecutor of a fixed 
size.  Construct it with a bounded BlockingQueue and a rejection policy 
of ThreadPoolExecutor.CallerRunsPolicy.  ThreadPoolExecutor is loaded 
with quirky features (bugs) that can be a hassle so it's not my first 
choice for very specific tasks.

Runtime.availableProcessors() can help you guess how much concurrent CPU 
time is available.  The actual concurrency is a complex factor so a 
simpler solution is creating as many threads as you can without running 
out of memory or thrashing the filesystem.
-- 
I will not see posts from Google because I must filter them as spam

[toc] | [prev] | [standalone]

csiph-web

Threading model for reading 1,000 files quickly?

Contents

#19007 — Threading model for reading 1,000 files quickly?

#19009

#19010

#19073

#19075

#19014

#19011

#19034