Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Error correction -- was State-of-the-art algorithms for lexical analysis?
Date: Tue, 7 Jun 2022 09:22:13 +0300
Organization: Compilers Central
Lines: 98
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-016@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="28573"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, errors
Posted-Date: 07 Jun 2022 10:52:44 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3054

I wouldn't do it as part of lexical analysis per se, but approximate
matching (Levershtein distances and related measures) can be
optimized.

Some of the seminal work in that area was done by Wu-Mamber.  They
effectively shown how to make NFAs that are insensitive to errors like
insertions and deletions.  Like many excellent advances, the idea is
wonderfully simple once you have seen it.

I would start with this wiki page if interested:

https://en.wikipedia.org/wiki/Agrep

Notably Villa Laurkkari had done a variant of his TRE tool which
handles PCRE extensions to include approximate matches.

You might start with his Master'sThesis:

https://laurikari.net/ville/regex-submatch.pdf

The FREJ (Fuzzy regular expressions for Java) library mentioned has
its own idiosyncratic notation, but is essentially a subset of PCRE
and cannot do anything that PCRE notation cannot.

-----------

All of this gets more use in the computational biology area.  They
divide regular expressions into two portions.  The fixed portion (i.e.
parts that can match only finite strings (no closures, no + or *, but
. and ?) are called Network Expressions) and the portions that can
match indefinite strings (parts with closures + or * operators) are
called separators.  One can argue about {m,n} constructs (counted
repetitons), if they are long enough calling them spacers is more
efficient.

Michella Becchi did lots of work in this area, especially some related
to efficient hardware processing of spacers including counted spacers.
You might start here:

https://regex.wustl.edu/indexd80ad80a.html?title=Regular_Expression_Processor

At Intel, she was interning with me there while at WashU (St Louis),
we did some other work on "Bushy" v. "Stringy" regular expressions
based upon insights from studying the Aho-Corasick algorithm.  After a
spacer (actually forming the tail of the spacer to be precise), there
tends to be a "bushy" part where the regular expression has high
fan-out (and the fail edges of the AC algorithm tend to return to that
section).  It tends to be short and wide.  Once the regular expression
has been trimmed to one (or a few) candidates, the fan-out essentially
becomes linear and it is long and narrow.  It can be desirable to
optimize these sections separately possibly even using different
representations.  I don't think we ever published a paper about that,
although you can infer it from some of the patents Intel holds in that
area, where we optimized hardware based upon that distinction.

You can also see it in LL/LR parsing algorithms.  Wherever you have an
embedded non-terminal, you get a bushy state as you process the
"first" items of that non-terminal.  If the grammar is LL(1), that set
of bushy states is 1 long, if LL(2) 2 bushy states, etc.

A similar effect applies to captures and back-references.  In fact,
that's my next "project" (should I ever find I have spare time).
Captures (and back-references) work like non-terminals.  The regular
expression in the capture is the definition of the non-terminal.  The
main distinction is that the capture and its back-reference need to
"match" the same string.  You can do that either by comparing them or
if you are really ambitious, building a "stringy" finite machine on
the fly, maybe using Boyer-Moore techniques to optimize it.  Suffix
tries might be useful there.

--------

There is very interesting work in this area.  However, I wouldn't use
much of it for lexical analysis.  the reason being simple.  Tokens in
lexical analysis all touch (abut) each other.  The closest you get to
this area (aside from error correction) is dealing with whitespace and
comments which you can treat like spacers, but it is really overkill
for that.  I suppose, if you have an indentation sensitive language,
you might look at the counted spacer work Michella did for
inspiration, but again it is likely overkill.

This work made sense when optimizing Snort (or maybe something ClamAV
like, but that got all morphed into comparing hashes ("signatures")).
Snort really abuses regular expressions to do long matches (and
grammar things) but without tokenizing.  So, this applies.  However,
anyone writing a "compiler" that way should have their head examined,
because it makes a slow expensive mess out of it.

Kind regards,
Chris

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------