Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From: Joshua Cranmer <Pidgeot18@verizon.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: SortedMap question?
Date: Mon, 26 Nov 2012 14:26:26 -0600
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <k90ja3$bg0$1@dont-email.me>
References: <k8rkud$i3o$1@dont-email.me> <q4qoyde15lpk$.13amx8nbfrj28.dlg@40tude.net> <k8t8dp$tv5$1@dont-email.me> <50b26f67$0$295$14726298@news.sunsite.dk> <k8tsfj$nn5$1@dont-email.me> <50b28b1a$0$289$14726298@news.sunsite.dk> <k8u6fq$i0q$1@dont-email.me> <ydnvccs87wo.fsf@shell.xmission.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 26 Nov 2012 20:26:43 +0000 (UTC)
Injection-Info: mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="11776"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1//4bRqPSU63lKwcOhkEJOe6ZsG+cLHWEY="
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0
In-Reply-To: <ydnvccs87wo.fsf@shell.xmission.com>
Cancel-Lock: sha1:+HBfbE0+WLW8ItzqlkakzfAsj9Y=
Xref: csiph.com comp.lang.java.programmer:19984

On 11/26/2012 1:45 PM, Jim Janney wrote:
> Eric Sosman <esosman@comcast-dot-net.invalid> writes:
>
>>      Or look at String#indexOf(String).  Some fast algorithms
>> like Boyer-Moore have already been mentioned in this thread, yet
>> Java's own implementation uses a naive method that is linear at
>> best, super-linear if unlucky (I think it can be O(M*N), where M
>> and N are the lengths of the two Strings).  If the naive method
>> were replaced by Boyer-Moore or Knuth-Morris-Pratt or some other
>> better-O() algorithm, would you expect performance to improve?
>
> Boyer-Moore and its descendants invest more time up front building the
> jump table, in return for faster searching once the table is built.
> That pays off if you're searching for the same pattern across a large
> volume of text.  But for the common case where strings are relatively
> short, I'd expect the brute-force (not naive) approach to be the better
> choice.

Well, it really pays off when the following conditions hold:
1. The string to search for is long.
2. The string to search for repeats a largeish prefix in itself later 
(for some of the later descendants)
3. You are likely to find *a lot* of partial matches of the string (so 
full backtracking measurably hurts).

I suspect English words in English text don't satisfy these conditions 
very well, and I presume that (or similar scenarios) would explain a lot 
of String.indexOf searches.

Where this would pay off would be something like trying to find DNA 
subsequences: you have a 1-in-4 chance of partially matching a given 
letter, your target string is likely to be lengthy and have lots of 
similarities within it, and the text being searched is extremely long.

There are other penalties to advanced substring finding algorithms than 
just the build-the-lookup table penalty. The most important is that 
using the jump table has a minor microarchitectural penalty. Other 
penalties include increased amount of memory you need in the cache and 
increased surface for bugs.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth