Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!storethat.news.telefonica.de!telefonica.de!news-1.dfn.de!news.dfn.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Robert Klemme Newsgroups: comp.lang.java.programmer Subject: Re: optimsed HashMap Date: Sat, 24 Nov 2012 16:24:27 +0100 Lines: 44 Message-ID: References: <8i70b8d0pm6ibk03ti4t2pv60jd0bctlcs@4ax.com> <5Yidnbg3DrTK2S3NnZ2dnUVZ_uudnZ2d@earthlink.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: individual.net pELMv6kCdUPj9HVuR6/zOgO7ylJKwZoRRt9xnpNHfCAi/c5Kkk7pd4B8pQzeTw5RY= Cancel-Lock: sha1:m/Azrw5HFwJd/UpOhlcH6rWnGDA= User-Agent: Mozilla/5.0 (Windows NT 6.0; WOW64; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 In-Reply-To: Xref: csiph.com comp.lang.java.programmer:19895 On 24.11.2012 12:39, Roedy Green wrote: > On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal" > wrote, quoted or indirectly > quoted someone who said : > >> Look into the literature on fast text searching (for instance bit-parallel >> matching). It's not entirely clear to me what Roedy is trying to do, but it >> sounds as if "bulk" matching/searching might be relevant. > > Yes a Boyer-Moore to simultaneously search for the whole list of > words, then when it has a hit see if it has word in isolation rather > than a word fragment. Here's another approach: 1. fill a HashMap with the translations. 2. Create a tree or trie from the keys. 3. Convert the trie to a regular expression optimized for NFA automata (such as is used in Java std. library). 4. Surround the regexp with additional regexp to ensure word matches and probably exclude matching inside HTML tags 5. Scan the document with Matcher.find() The idea of item 3 is to create a regexp with as little backtracking as possible. For example, from foo foot fuss you make (?:f(?:oot?)|uss) Not sure though whether it is dramatically faster or slower than a standard string search like Boyer-Moore - probably not. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/