Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!border2.nntp.ams2.giganews.com!border4.nntp.ams.giganews.com!border2.nntp.ams.giganews.com!border3.nntp.ams.giganews.com!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.bt.com!news.bt.com.POSTED!not-for-mail NNTP-Posting-Date: Sun, 25 Nov 2012 07:50:43 -0600 From: "Chris Uppal" Newsgroups: comp.lang.java.programmer References: <8i70b8d0pm6ibk03ti4t2pv60jd0bctlcs@4ax.com> <5Yidnbg3DrTK2S3NnZ2dnUVZ_uudnZ2d@earthlink.com> Subject: Re: optimsed HashMap Date: Sun, 25 Nov 2012 13:50:20 -0000 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5512 X-RFC2646: Format=Flowed; Response X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512 Message-ID: Lines: 40 X-Usenet-Provider: http://www.giganews.com X-AuthenticatedUsername: NoAuthUser X-Trace: sv3-4jpYqdmJybJ5HVrRHtd3lTMEcn9Wazp+3amyeC58VUvWHQaFRjubQemlt4jvwaNYpq1RIutRKyhQD17!RskPiDYKhm3jdt8AGGIQz6eYJSxBqRn3klpaYw5GjLj5m01Q1/M9uplo/FruIZqXpG+MAoOJ5o0= X-Complaints-To: abuse@btinternet.com X-DMCA-Complaints-To: abuse@btinternet.com X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 2587 Xref: csiph.com comp.lang.java.programmer:19934 Robert Klemme wrote: > 1. fill a HashMap with the translations. > 2. Create a tree or trie from the keys. > 3. Convert the trie to a regular expression optimized for NFA automata > (such as is used in Java std. library). > 4. Surround the regexp with additional regexp to ensure word matches and > probably exclude matching inside HTML tags > 5. Scan the document with Matcher.find() > > The idea of item 3 is to create a regexp with as little backtracking as > possible. For example, from > > foo > foot > fuss > > you make > > (?:f(?:oot?)|uss) Alternatively, use a real DFA matcher which will do all of that work for you as it creates a minimal (or maybe only nearly mininal) DFA. > Not sure though whether it is dramatically faster or slower than a standard > string search like Boyer-Moore - probably not. As I mentioned elsewhere BM is no longer considered the best (there's a nice, albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern Matching in Strings"). But my gut feeling (another way of saying that I haven't bothered to test it) is that IO would probably dominate the exection time whatever algorithm was used (assuming it wasn't implemented inefficiently). -- chris -- chris