Groups > comp.lang.java.programmer > #19075

Re: Threading model for reading 1,000 files quickly?

From	Robert Klemme <shortcutter@googlemail.com>
Newsgroups	comp.lang.java.programmer
Subject	Re: Threading model for reading 1,000 files quickly?
Date	2012-10-03 13:58 +0200
Message-ID	<ad2njtFh63kU1@mid.individual.net> (permalink)
References	<051fc3d6-d22c-438a-b4d3-84378e447733@googlegroups.com> <K-SdnU_ujNRnyvTNnZ2dnUVZ7q8AAAAA@bt.com> <1YednWK6TvBYGPTNnZ2dnUVZ_uudnZ2d@earthlink.com> <jJmdnZ-kl6IJdfbNnZ2dnUVZ7rCdnZ2d@bt.com>

Show all headers | View raw

On 03.10.2012 09:24, Chris Uppal wrote:

> I must admit that I had forgotten that aspect of the situation.

To me it seems there are a lot more "forgotten aspects"...

> Consider: would you choose the time when you've got a big disk operation
> running (copying a huge number of files say) to kick off a virus scan on the
> same spindle ?   I most certainly would not, perhaps your experience has been
> different.

File copying is only IO with negligible CPU, virus scanning only looks 
at portions of files.  We do not know whether that scenario only 
remotely resembles the problem the OP is trying to tackle.

> The problem is that the analysis in terms of scattered disk blocks is
> unrealistic.If the blocks of each file are actually randomised across the
> disk, then the analysis works.  But in that case a simple defrag seems to make
> more sense to me.

Not all file systems support online or offline defragmentation and we do 
not even yet know the file system.  Heck, files may actually reside on 
some type of network share or on a RAID storage with it's own caching 
and read strategies.  Also, since defragmentation usually works on a 
whole file system the cost overhead might not pay off at all.  Btw. 
another fact we do not know yet (as far as I can see) is whether this is 
a one off thing or the processing should be done repeatedly (in case of 
one off the whole discussion is superfluous as it costs more time than 
the overhead of a sub optimal IO and threading strategy).  It may also 
make sense to know how files get there (maybe it's even more efficient 
to fetch files in Java with a HTTP client from where they are taken and 
process them while downloading, i.e. without ever writing them to disk).

>  If, on the other hand, the block/s/ in most files are
> mostly contiguous, and each thread is processing those blocks mostly
> sequentially, then running even two threads will turn the fast sequential
> access pattern
>
>      B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m
>
> into something more like:
>
>      B+0, C+0, B+1, C+1, ... B+n, C+m
>
> which is a disaster.

We cannot know.  First of all we do not know the size of files, do we? 
So files might actually take up just one block.  Then, the operating 
system might actually be prefetching blocks of individual files when it 
detects the access pattern (reading in one go from head to tail) to fill 
the cache even before blocks are accessed.

Oh, and btw., we do not even know the read pattern, do we?  Are files 
read from beginning to end?  Are they accessed more in a random access 
fashion?  And we do not know the nature of the processing either.  At 
the moment we just know that it takes one to two seconds (on what 
hardware and OS?) - but we do not know whether that is because of CPU 
load or IO load etc.

> Of course, my analysis also depends on assumptions about the actual files and
> their layout, but I don't think the assumptions are unreasonable.  In fact, in
> the absence of more specific data, I'd call 'em good ;-)

That's a bold statement.  You call an analysis "good" which just fills 
in unmentioned assumptions for missing facts - a lot of missing facts.

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Thread

Threading model for reading 1,000 files quickly? leegee@gmail.com - 2012-10-01 00:11 -0700
  Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-01 09:43 +0100
    Re: Threading model for reading 1,000 files quickly? Patricia Shanahan <pats@acm.org> - 2012-10-01 05:00 -0700
      Re: Threading model for reading 1,000 files quickly? "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-03 08:24 +0100
        Re: Threading model for reading 1,000 files quickly? Robert Klemme <shortcutter@googlemail.com> - 2012-10-03 13:58 +0200
    Re: Threading model for reading 1,000 files quickly? markspace <-@.> - 2012-10-01 09:35 -0700
  Re: Threading model for reading 1,000 files quickly? Eric Sosman <esosman@ieee-dot-org.invalid> - 2012-10-01 09:32 -0400
  Re: Threading model for reading 1,000 files quickly? Kevin McMurtrie <mcmurtrie@pixelmemory.us> - 2012-10-01 20:11 -0700

csiph-web