Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!border2.nntp.ams2.giganews.com!border4.nntp.ams.giganews.com!border2.nntp.ams.giganews.com!border3.nntp.ams.giganews.com!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.bt.com!news.bt.com.POSTED!not-for-mail
NNTP-Posting-Date: Sun, 25 Nov 2012 07:50:43 -0600
From: "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Newsgroups: comp.lang.java.programmer
References: <8i70b8d0pm6ibk03ti4t2pv60jd0bctlcs@4ax.com> <5Yidnbg3DrTK2S3NnZ2dnUVZ_uudnZ2d@earthlink.com> <AOKdncX-1L4NPC3NnZ2dnUVZ8k2dnZ2d@bt.com> <p7c1b8dk6uebqjvil7iuo56cq40j0haj94@4ax.com> <ahc75eFn0gqU1@mid.individual.net>
Subject: Re: optimsed HashMap
Date: Sun, 25 Nov 2012 13:50:20 -0000
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5512
X-RFC2646: Format=Flowed; Response
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512
Message-ID: <fNCdnchke6quvy_NnZ2dnUVZ8hCdnZ2d@bt.com>
Lines: 40
X-Usenet-Provider: http://www.giganews.com
X-AuthenticatedUsername: NoAuthUser
X-Trace: sv3-4jpYqdmJybJ5HVrRHtd3lTMEcn9Wazp+3amyeC58VUvWHQaFRjubQemlt4jvwaNYpq1RIutRKyhQD17!RskPiDYKhm3jdt8AGGIQz6eYJSxBqRn3klpaYw5GjLj5m01Q1/M9uplo/FruIZqXpG+MAoOJ5o0=
X-Complaints-To: abuse@btinternet.com
X-DMCA-Complaints-To: abuse@btinternet.com
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 2587
Xref: csiph.com comp.lang.java.programmer:19934

Robert Klemme wrote:

> 1. fill a HashMap with the translations.
> 2. Create a tree or trie from the keys.
> 3. Convert the trie to a regular expression optimized for NFA automata
> (such as is used in Java std. library).
> 4. Surround the regexp with additional regexp to ensure word matches and
> probably exclude matching inside HTML tags
> 5. Scan the document with Matcher.find()
>
> The idea of item 3 is to create a regexp with as little backtracking as
> possible.  For example, from
>
> foo
> foot
> fuss
>
> you make
>
> (?:f(?:oot?)|uss)

Alternatively, use a real DFA matcher which will do all of that work for you as 
it creates a minimal (or maybe only nearly mininal) DFA.

> Not sure though whether it is dramatically faster or slower than a standard 
> string search like Boyer-Moore - probably not.

As I mentioned elsewhere BM is no longer considered the best (there's a nice, 
albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern 
Matching in Strings").  But my gut feeling (another way of saying that I 
haven't bothered to test it) is that IO would probably dominate the exection 
time whatever algorithm was used (assuming it wasn't implemented 
inefficiently).

    -- chris


    -- chris