Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Eric Sosman Newsgroups: comp.lang.java.programmer Subject: Re: The first 10 files Date: Sat, 26 Jan 2013 20:42:16 -0500 Organization: A noiseless patient Spider Lines: 91 Message-ID: References: <51041ff8$0$284$14726298@news.sunsite.dk> <1iop8bl8ysrfg$.rdxcxhgxuj1r$.dlg@40tude.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 27 Jan 2013 01:42:16 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="ffb8f7085759b339c1002252b48331a4"; logging-data="29217"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/koNMY6c0UF5x4Oy1BWpEz" User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 In-Reply-To: <1iop8bl8ysrfg$.rdxcxhgxuj1r$.dlg@40tude.net> Cancel-Lock: sha1:2Ryh8qP6E1wElN+zgbQSrHgSD/g= Xref: csiph.com comp.lang.java.programmer:21764 On 1/26/2013 6:21 PM, Peter Duniho wrote: > On Sat, 26 Jan 2013 17:06:07 -0500, Eric Sosman wrote: > >> On 1/26/2013 4:15 PM, Robert Klemme wrote: >>> On 26.01.2013 19:26, Arne Vajhøj wrote: >>> >>>> But I am a bit skeptical about whether a String[] with 30K elements >>>> is really the bottleneck. >>>> >>>> If the real bottleneck is the OS calls to get next file, then >>>> a filter like this will not help. >>> >>> Why? >> >> Because the listFiles() method will fetch the information >> for all 30K files from the O/S, will construct 30K File objects >> to represent them, and will submit all 30K File objects to the >> FileFilter, one by one. The FileFilter will (very quickly) >> reject 29.99K of the 30K Files, but ... > > Will it? Necessarily. As far as listFiles() knows, the FileFilter might accept the very last File object given to it. Therefore, listFiles() cannot fail to present that very last File -- and every other File -- for inspection. > It is plausible that the implementation of listFiles() uses an OS API that > enumerates files one at a time. On Windows, getting the first file of the > enumeration is faster than asking for all the files at once. Meh. > Indeed, I suppose one could throw an exception from the FileFilter accept() > method to interrupt enumeration, if that's how listFiles() is implemented. > That would avoid the need to enumerate more than the needed number of > actual files. It would also avoid the burden of returning anything from listFiles() -- like, say, the array of accepted files ... A seriously hackish approach might be to do the processing of the files within the FileFilter itself, treating it as a "visit this File" callback instead of as a predicate. Then if the FileFilter threw an exception after processing the first N files -- well, they'd already have been processed, and you were going to ignore the listFiles() return value anyhow, so ... But, as I said, that's pretty seriously hackish. > Of course, this is all implementation-dependent and since it's not > explicitly documented, could change at any time anyway. The performance implications of retrieving information on 30K files from the O/S are undocumented, true. But the necessity of retrieving that information is deducible from what *is* documented. > But unless you've > actually examined the implementation details for listFiles(), it's not a > foregone conclusion that the technique of using a FileFilter offers no way > to improve latency. Maybe this is the disconnect: I understood the O.P.'s concern as "It's doing three thousand times too much work," not as "It takes three thousand times as long as it should just to get to the first File instance." Either way, though, I think a FileFilter (used in a non-hackish way) cannot reduce either the total work or the latency. Observe that listFiles() cannot return anything at all until it has built the entire array of accepted files; Java's arrays have no way to say "I hold five elements now, but might grow." > All that said, I think John Matthews' comment about the question of what > 30K files are doing in a single directory in the first place is perhaps one > of the more useful points in this topic. One doesn't always have control > over that, of course...but if one does, it's certainly worth rethinking > that aspect of the design. There are reasons other than code latency to > avoid so many files in a single directory. Yeah. The O.P. said something about external processes dumping files into the directory, possibly dumping many between (widely- spaced?) executions of his program. That seems odd to me, though, because if there's a backlog of thirty thousand it seems odd to want to reduce it by only ten ... If he's stuck with this overall design, though, I think the walkFileTree() method of java.nio.file.Files would be a cleaner way to proceed. His FileVisitor could return FileVisitResult.TERMINATE after it had seen ten files, and that would be that. No hacks. -- Eric Sosman esosman@comcast-dot-net.invalid