Path: csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: markspace <-@.> Newsgroups: comp.lang.java.programmer Subject: Re: Patricia trie vs binary search. Date: Thu, 24 May 2012 17:56:14 -0700 Organization: A noiseless patient Spider Lines: 38 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 25 May 2012 00:56:15 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="2kn9RzOWSe/v/hLnHgGT4Q"; logging-data="10360"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18kzh+PT6Snlt1WsXnbo+lXPP8LpuSlQLw=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 In-Reply-To: Cancel-Lock: sha1:5zZpQeNNLLLI11n66KN1hd+vL4M= Xref: csiph.com comp.lang.java.programmer:14787 On 5/24/2012 4:39 PM, glen herrmannsfeldt wrote: > markspace<-@.> wrote: >> For some reason I was thinking about sub-string searches today. I >> looked up Patricia tries (a kind of radix tree) to see if they would >> help. While interesting, the radix tree seems to have a lot of overhead >> for large numbers of entries. > > I am not so sure which substring problem you are interested in, > but you might look at the Aho-Korasick algorithm. > > If you want to allow for insertions and/or deletions, look > at dynamic programming. Thanks for those pointers (pun not intended). The "sub-string search" was referring to was picking words out of a longer, undelimited string. The example I posted looked for bad passwords within a larger password. "passxword" contains at least two bad passwords, pass and word, according to the dictionary file I download from here (that's the bpw.txt file in the sample code): The code I posted find pass, ass, word, and or in the string "passxword", which are all bad passwords according to that link above. (I deliberately avoid reporting single letters as bad). Practically speaking, hashing appears to be just as fast for a lookup, but hashing doesn't tell you when to stop, so it's always N^2 for this sort of problem. My algorithm should average less, hopefully much less if you have very long input strings. I really don't know what set me off looking at this; it's just a little random experiment I suppose.