Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Joshua Cranmer Newsgroups: comp.lang.java.programmer Subject: Re: SortedMap question? Date: Mon, 26 Nov 2012 14:26:26 -0600 Organization: A noiseless patient Spider Lines: 43 Message-ID: References: <50b26f67$0$295$14726298@news.sunsite.dk> <50b28b1a$0$289$14726298@news.sunsite.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 26 Nov 2012 20:26:43 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="11776"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//4bRqPSU63lKwcOhkEJOe6ZsG+cLHWEY=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0 In-Reply-To: Cancel-Lock: sha1:+HBfbE0+WLW8ItzqlkakzfAsj9Y= Xref: csiph.com comp.lang.java.programmer:19984 On 11/26/2012 1:45 PM, Jim Janney wrote: > Eric Sosman writes: > >> Or look at String#indexOf(String). Some fast algorithms >> like Boyer-Moore have already been mentioned in this thread, yet >> Java's own implementation uses a naive method that is linear at >> best, super-linear if unlucky (I think it can be O(M*N), where M >> and N are the lengths of the two Strings). If the naive method >> were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other >> better-O() algorithm, would you expect performance to improve? > > Boyer-Moore and its descendants invest more time up front building the > jump table, in return for faster searching once the table is built. > That pays off if you're searching for the same pattern across a large > volume of text. But for the common case where strings are relatively > short, I'd expect the brute-force (not naive) approach to be the better > choice. Well, it really pays off when the following conditions hold: 1. The string to search for is long. 2. The string to search for repeats a largeish prefix in itself later (for some of the later descendants) 3. You are likely to find *a lot* of partial matches of the string (so full backtracking measurably hurts). I suspect English words in English text don't satisfy these conditions very well, and I presume that (or similar scenarios) would explain a lot of String.indexOf searches. Where this would pay off would be something like trying to find DNA subsequences: you have a 1-in-4 chance of partially matching a given letter, your target string is likely to be lengthy and have lots of similarities within it, and the text being searched is extremely long. There are other penalties to advanced substring finding algorithms than just the build-the-lookup table penalty. The most important is that using the jump table has a minor microarchitectural penalty. Other penalties include increased amount of memory you need in the cache and increased surface for bugs. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth