Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19075
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: Threading model for reading 1,000 files quickly? |
| Date | 2012-10-03 13:58 +0200 |
| Message-ID | <ad2njtFh63kU1@mid.individual.net> (permalink) |
| References | <051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com> <K-SdnU_ujNRnyvTNnZ2dnUVZ7q8AAAAA@bt.com> <1YednWK6TvBYGPTNnZ2dnUVZ_uudnZ2d@earthlink.com> <jJmdnZ-kl6IJdfbNnZ2dnUVZ7rCdnZ2d@bt.com> |
On 03.10.2012 09:24, Chris Uppal wrote: > I must admit that I had forgotten that aspect of the situation. To me it seems there are a lot more "forgotten aspects"... > Consider: would you choose the time when you've got a big disk operation > running (copying a huge number of files say) to kick off a virus scan on the > same spindle ? I most certainly would not, perhaps your experience has been > different. File copying is only IO with negligible CPU, virus scanning only looks at portions of files. We do not know whether that scenario only remotely resembles the problem the OP is trying to tackle. > The problem is that the analysis in terms of scattered disk blocks is > unrealistic.If the blocks of each file are actually randomised across the > disk, then the analysis works. But in that case a simple defrag seems to make > more sense to me. Not all file systems support online or offline defragmentation and we do not even yet know the file system. Heck, files may actually reside on some type of network share or on a RAID storage with it's own caching and read strategies. Also, since defragmentation usually works on a whole file system the cost overhead might not pay off at all. Btw. another fact we do not know yet (as far as I can see) is whether this is a one off thing or the processing should be done repeatedly (in case of one off the whole discussion is superfluous as it costs more time than the overhead of a sub optimal IO and threading strategy). It may also make sense to know how files get there (maybe it's even more efficient to fetch files in Java with a HTTP client from where they are taken and process them while downloading, i.e. without ever writing them to disk). > If, on the other hand, the block/s/ in most files are > mostly contiguous, and each thread is processing those blocks mostly > sequentially, then running even two threads will turn the fast sequential > access pattern > > B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m > > into something more like: > > B+0, C+0, B+1, C+1, ... B+n, C+m > > which is a disaster. We cannot know. First of all we do not know the size of files, do we? So files might actually take up just one block. Then, the operating system might actually be prefetching blocks of individual files when it detects the access pattern (reading in one go from head to tail) to fill the cache even before blocks are accessed. Oh, and btw., we do not even know the read pattern, do we? Are files read from beginning to end? Are they accessed more in a random access fashion? And we do not know the nature of the processing either. At the moment we just know that it takes one to two seconds (on what hardware and OS?) - but we do not know whether that is because of CPU load or IO load etc. > Of course, my analysis also depends on assumptions about the actual files and > their layout, but I don't think the assumptions are unreasonable. In fact, in > the absence of more specific data, I'd call 'em good ;-) That's a bold statement. You call an analysis "good" which just fills in unmentioned assumptions for missing facts - a lot of missing facts. Cheers robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Threading model for reading 1,000 files quickly? leegee@gmail.com - 2012-10-01 00:11 -0700
Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-01 09:43 +0100
Re: Threading model for reading 1,000 files quickly? Patricia Shanahan <pats@acm.org> - 2012-10-01 05:00 -0700
Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-03 08:24 +0100
Re: Threading model for reading 1,000 files quickly? Robert Klemme <shortcutter@googlemail.com> - 2012-10-03 13:58 +0200
Re: Threading model for reading 1,000 files quickly? markspace <-@.> - 2012-10-01 09:35 -0700
Re: Threading model for reading 1,000 files quickly? Eric Sosman <esosman@ieee-dot-org.invalid> - 2012-10-01 09:32 -0400
Re: Threading model for reading 1,000 files quickly? Kevin McMurtrie <mcmurtrie@pixelmemory.us> - 2012-10-01 20:11 -0700
csiph-web