Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19007 > unrolled thread
| Started by | leegee@gmail.com |
|---|---|
| First post | 2012-10-01 00:11 -0700 |
| Last post | 2012-10-01 20:11 -0700 |
| Articles | 8 — 7 participants |
Back to article view | Back to comp.lang.java.programmer
Threading model for reading 1,000 files quickly? leegee@gmail.com - 2012-10-01 00:11 -0700
Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-01 09:43 +0100
Re: Threading model for reading 1,000 files quickly? Patricia Shanahan <pats@acm.org> - 2012-10-01 05:00 -0700
Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-03 08:24 +0100
Re: Threading model for reading 1,000 files quickly? Robert Klemme <shortcutter@googlemail.com> - 2012-10-03 13:58 +0200
Re: Threading model for reading 1,000 files quickly? markspace <-@.> - 2012-10-01 09:35 -0700
Re: Threading model for reading 1,000 files quickly? Eric Sosman <esosman@ieee-dot-org.invalid> - 2012-10-01 09:32 -0400
Re: Threading model for reading 1,000 files quickly? Kevin McMurtrie <mcmurtrie@pixelmemory.us> - 2012-10-01 20:11 -0700
| From | leegee@gmail.com |
|---|---|
| Date | 2012-10-01 00:11 -0700 |
| Subject | Threading model for reading 1,000 files quickly? |
| Message-ID | <051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com> |
I have directory with many sub-directories, each with many thousands of files. I wish to process each file, which takes one or two seconds. I wish to simultaneously process as many files as possible. I'm not clear the best approach to this. Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute? TIA Lee
[toc] | [next] | [standalone]
| From | "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> |
|---|---|
| Date | 2012-10-01 09:43 +0100 |
| Message-ID | <K-SdnU_ujNRnyvTNnZ2dnUVZ7q8AAAAA@bt.com> |
| In reply to | #19007 |
leegee@gmail.com wrote:
> I have directory with many sub-directories, each with many thousands of
> files.
>
> I wish to process each file, which takes one or two seconds.
>
> I wish to simultaneously process as many files as possible.
Your problem here is not threading, but disk IO. Specifically disk seeks. If
you are using a rotating disk (as opposed to a SSD), and all the files are on
the same spindle, then using > 1 thread will just slow things down as the
different thread "fight" to position the disk heads over "their" files.
If you are using more than one spindle (say in a RAID array) then you may find
benefits in using a similar number of threads.
If the processing is CPU bound rather than IO bound when you are processing
just one file (doesn't sound like it, but may be true) then you can perhaps get
benefits by using roughly as many threads and you have real cores available to
compute.
-- chris
[toc] | [prev] | [next] | [standalone]
| From | Patricia Shanahan <pats@acm.org> |
|---|---|
| Date | 2012-10-01 05:00 -0700 |
| Message-ID | <1YednWK6TvBYGPTNnZ2dnUVZ_uudnZ2d@earthlink.com> |
| In reply to | #19009 |
On 10/1/2012 1:43 AM, Chris Uppal wrote: > leegee@gmail.com wrote: >> I have directory with many sub-directories, each with many thousands of >> files. >> >> I wish to process each file, which takes one or two seconds. >> >> I wish to simultaneously process as many files as possible. > > Your problem here is not threading, but disk IO. Specifically disk seeks. If > you are using a rotating disk (as opposed to a SSD), and all the files are on > the same spindle, then using > 1 thread will just slow things down as the > different thread "fight" to position the disk heads over "their" files. > > If you are using more than one spindle (say in a RAID array) then you > may find benefits in using a similar number of threads. > > If the processing is CPU bound rather than IO bound when you are > processing just one file (doesn't sound like it, but may be true) > then you can perhaps get benefits by using roughly as many threads > and you have real cores available to compute. I agree with the idea that the objective, for rotating disk, should probably be to optimize use of the disk head's time. I disagree with the conclusion. There is no reason to expect the files to be laid out on disk in the order of requests. It is entirely possible that files N+2, N+3, and N+4 are physically between files N and N+1 for some values of N. Either or both of the operating system or the disk drive may be optimizing the request order to reduce head movement. If the scheduling algorithm knows that all of N through N+4 are needed, it can stop the head at each track that has one of them and read it as the head is moving from N to N+1. If you feed the requests to the operating system one at a time, and wait for each to finish, the disk head will be forced to do the reads in First-Come-First-Served order, regardless of disk placement. That will probably not be the optimal order. If you have too many requests outstanding there is a risk of overloading the operating system's buffering. I would suggest either using asynchronous I/O or a thread pool, so that the number of outstanding requests can be tuned based on measurements. I will be surprised of the optimal queue length is one. Patricia
[toc] | [prev] | [next] | [standalone]
| From | "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> |
|---|---|
| Date | 2012-10-03 08:24 +0100 |
| Message-ID | <jJmdnZ-kl6IJdfbNnZ2dnUVZ7rCdnZ2d@bt.com> |
| In reply to | #19010 |
Patricia Shanahan wrote:
> > [me]
> > If you are using more than one spindle (say in a RAID array) then you
> > may find benefits in using a similar number of threads.
> >
> > If the processing is CPU bound rather than IO bound when you are
> > processing just one file (doesn't sound like it, but may be true)
> > then you can perhaps get benefits by using roughly as many threads
> > and you have real cores available to compute.
>
> I agree with the idea that the objective, for rotating disk, should
> probably be to optimize use of the disk head's time. I disagree with
> the conclusion.
>
> There is no reason to expect the files to be laid out on disk in the
> order of requests. It is entirely possible that files N+2, N+3, and N+4
> are physically between files N and N+1 for some values of N. Either or
> both of the operating system or the disk drive may be optimizing the
> request order to reduce head movement. If the scheduling algorithm knows
> that all of N through N+4 are needed, it can stop the head at each
> track that has one of them and read it as the head is moving from N to
> N+1.
I must admit that I had forgotten that aspect of the situation. And my advice
should certainly be amended to include the possibility that using some extra
threads might help. Thank you.
But -- at the risk of sounding merely defensive -- I doubt whether it would
work in practice.
Consider: would you choose the time when you've got a big disk operation
running (copying a huge number of files say) to kick off a virus scan on the
same spindle ? I most certainly would not, perhaps your experience has been
different.
The problem is that the analysis in terms of scattered disk blocks is
unrealistic. If the blocks of each file are actually randomised across the
disk, then the analysis works. But in that case a simple defrag seems to make
more sense to me. If, on the other hand, the block/s/ in most files are
mostly contiguous, and each thread is processing those blocks mostly
sequentially, then running even two threads will turn the fast sequential
access pattern
B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m
into something more like:
B+0, C+0, B+1, C+1, ... B+n, C+m
which is a disaster. Of course, if you are running 3 or more threads then the
impact of the jumps might well be reduced, but even if we have, say:
B+0, C+0, D+0 B+1, D+1, C+1, ... B+n, C+m, D+p
with C+k, B+k, and D+k ordered optimally, we still aren't going to get anything
like the pure sequential performance of the non-threaded version.
Of course, my analysis also depends on assumptions about the actual files and
their layout, but I don't think the assumptions are unreasonable. In fact, in
the absence of more specific data, I'd call 'em good ;-)
-- chris
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-10-03 13:58 +0200 |
| Message-ID | <ad2njtFh63kU1@mid.individual.net> |
| In reply to | #19073 |
On 03.10.2012 09:24, Chris Uppal wrote: > I must admit that I had forgotten that aspect of the situation. To me it seems there are a lot more "forgotten aspects"... > Consider: would you choose the time when you've got a big disk operation > running (copying a huge number of files say) to kick off a virus scan on the > same spindle ? I most certainly would not, perhaps your experience has been > different. File copying is only IO with negligible CPU, virus scanning only looks at portions of files. We do not know whether that scenario only remotely resembles the problem the OP is trying to tackle. > The problem is that the analysis in terms of scattered disk blocks is > unrealistic.If the blocks of each file are actually randomised across the > disk, then the analysis works. But in that case a simple defrag seems to make > more sense to me. Not all file systems support online or offline defragmentation and we do not even yet know the file system. Heck, files may actually reside on some type of network share or on a RAID storage with it's own caching and read strategies. Also, since defragmentation usually works on a whole file system the cost overhead might not pay off at all. Btw. another fact we do not know yet (as far as I can see) is whether this is a one off thing or the processing should be done repeatedly (in case of one off the whole discussion is superfluous as it costs more time than the overhead of a sub optimal IO and threading strategy). It may also make sense to know how files get there (maybe it's even more efficient to fetch files in Java with a HTTP client from where they are taken and process them while downloading, i.e. without ever writing them to disk). > If, on the other hand, the block/s/ in most files are > mostly contiguous, and each thread is processing those blocks mostly > sequentially, then running even two threads will turn the fast sequential > access pattern > > B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m > > into something more like: > > B+0, C+0, B+1, C+1, ... B+n, C+m > > which is a disaster. We cannot know. First of all we do not know the size of files, do we? So files might actually take up just one block. Then, the operating system might actually be prefetching blocks of individual files when it detects the access pattern (reading in one go from head to tail) to fill the cache even before blocks are accessed. Oh, and btw., we do not even know the read pattern, do we? Are files read from beginning to end? Are they accessed more in a random access fashion? And we do not know the nature of the processing either. At the moment we just know that it takes one to two seconds (on what hardware and OS?) - but we do not know whether that is because of CPU load or IO load etc. > Of course, my analysis also depends on assumptions about the actual files and > their layout, but I don't think the assumptions are unreasonable. In fact, in > the absence of more specific data, I'd call 'em good ;-) That's a bold statement. You call an analysis "good" which just fills in unmentioned assumptions for missing facts - a lot of missing facts. Cheers robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-10-01 09:35 -0700 |
| Message-ID | <k4cgoa$uik$1@dont-email.me> |
| In reply to | #19009 |
On 10/1/2012 1:43 AM, Chris Uppal wrote: > > Your problem here is not threading, but disk IO. Specifically disk seeks. If > you are using a rotating disk (as opposed to a SSD), and all the files are on > the same spindle, then using > 1 thread will just slow things down as the > different thread "fight" to position the disk heads over "their" files. > I gotta disagree also. It's actually well known trick to "flood" a disk controller with requests, so that the disk controller can organize and optimize the head read/write access. Especially if the files are small, it's possible the disk controller might be able to read multiple files in multiple blocks in a single revolution of the disk platter. You can't do that if you only open one file; the disk controller would only read the one file in that case. Multiple threads are probably the easiest way to do this. Yes, if performance is critical, you might investigate a single thread and non-blocking IO. Less the overhead of multiple threads, that might actually be faster. But since IO is still the limiting factor, I think the speed increase would be small. It also might be worth it for the OP to test their own performance. There's lots of ways here that we could be wrong, given unexpected system features. SingleThreadedExecutor vs. CachedThreadPool should make this easy to test.
[toc] | [prev] | [next] | [standalone]
| From | Eric Sosman <esosman@ieee-dot-org.invalid> |
|---|---|
| Date | 2012-10-01 09:32 -0400 |
| Message-ID | <k4c62a$o26$1@dont-email.me> |
| In reply to | #19007 |
On 10/1/2012 3:11 AM, leegee@gmail.com wrote:
> I have directory with many sub-directories, each with many thousands of files.
>
> I wish to process each file, which takes one or two seconds.
"Many" sub-directories (let's say a hundred) times "many
thousands" of files each (let's say ten thousand) times "one or
two" seconds per file (let's say 1.5) -- Okay, that's about two
and a half weeks if you process them one at a time.
If two and a half weeks is an acceptable amount of time, you
should probably turn your attention to things like checkpointing
of partial results, so that a power failure or crash on Day Nine
doesn't force you to restart from the very beginning.
> I wish to simultaneously process as many files as possible.
"As many as possible" -- With what kind of equipment? What's
"possible" for an Amazon data center might be beyond the reach
of an Amazon Kindle ...
> I'm not clear the best approach to this.
>
> Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?
You should use a thousand machines, each doing a thousandth
of the overall job. They'll finish in about twenty minutes.
(If you want more definite answers, you'll need to provide a
more definite problem statement.)
--
Eric Sosman
esosman@ieee-dot-org.invalid
[toc] | [prev] | [next] | [standalone]
| From | Kevin McMurtrie <mcmurtrie@pixelmemory.us> |
|---|---|
| Date | 2012-10-01 20:11 -0700 |
| Message-ID | <506a5b68$0$71178$742ec2ed@news.sonic.net> |
| In reply to | #19007 |
In article <051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com>, leegee@gmail.com wrote: > I have directory with many sub-directories, each with many thousands of > files. > > I wish to process each file, which takes one or two seconds. > > I wish to simultaneously process as many files as possible. > > I'm not clear the best approach to this. > > Should I use a thread pool? Is Java capable of maximising the hardware > resources to determine how many threads it can simultaneously execute? > > TIA > Lee Your main thread creates several worker threads, recurses the directory structure, creates a task to process each file, and puts each task in a shared BlockingQueue having a bounded size. When all files have been queued, a special END marker is placed on the end of the BlockingQueue and join() is called for each worker thread. Each worker thread pulls one task from the shared BlockingQueue, executes it, and loops to pull another task. When the END marker is seen by a thread, it is placed back into the BlockingQueue and the thread exits. An alternative is to use a customized ThreadPoolExecutor of a fixed size. Construct it with a bounded BlockingQueue and a rejection policy of ThreadPoolExecutor.CallerRunsPolicy. ThreadPoolExecutor is loaded with quirky features (bugs) that can be a hassle so it's not my first choice for very specific tasks. Runtime.availableProcessors() can help you guess how much concurrent CPU time is available. The actual concurrency is a complex factor so a simpler solution is creating as many threads as you can without running out of memory or thrashing the filesystem. -- I will not see posts from Google because I must filter them as spam
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web