Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Error correction -- was State-of-the-art algorithms for lexical analysis? Date: Tue, 7 Jun 2022 09:22:13 +0300 Organization: Compilers Central Lines: 98 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-06-016@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="28573"; mail-complaints-to="abuse@iecc.com" Keywords: lex, errors Posted-Date: 07 Jun 2022 10:52:44 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3054 I wouldn't do it as part of lexical analysis per se, but approximate matching (Levershtein distances and related measures) can be optimized. Some of the seminal work in that area was done by Wu-Mamber. They effectively shown how to make NFAs that are insensitive to errors like insertions and deletions. Like many excellent advances, the idea is wonderfully simple once you have seen it. I would start with this wiki page if interested: https://en.wikipedia.org/wiki/Agrep Notably Villa Laurkkari had done a variant of his TRE tool which handles PCRE extensions to include approximate matches. You might start with his Master'sThesis: https://laurikari.net/ville/regex-submatch.pdf The FREJ (Fuzzy regular expressions for Java) library mentioned has its own idiosyncratic notation, but is essentially a subset of PCRE and cannot do anything that PCRE notation cannot. ----------- All of this gets more use in the computational biology area. They divide regular expressions into two portions. The fixed portion (i.e. parts that can match only finite strings (no closures, no + or *, but . and ?) are called Network Expressions) and the portions that can match indefinite strings (parts with closures + or * operators) are called separators. One can argue about {m,n} constructs (counted repetitons), if they are long enough calling them spacers is more efficient. Michella Becchi did lots of work in this area, especially some related to efficient hardware processing of spacers including counted spacers. You might start here: https://regex.wustl.edu/indexd80ad80a.html?title=Regular_Expression_Processor At Intel, she was interning with me there while at WashU (St Louis), we did some other work on "Bushy" v. "Stringy" regular expressions based upon insights from studying the Aho-Corasick algorithm. After a spacer (actually forming the tail of the spacer to be precise), there tends to be a "bushy" part where the regular expression has high fan-out (and the fail edges of the AC algorithm tend to return to that section). It tends to be short and wide. Once the regular expression has been trimmed to one (or a few) candidates, the fan-out essentially becomes linear and it is long and narrow. It can be desirable to optimize these sections separately possibly even using different representations. I don't think we ever published a paper about that, although you can infer it from some of the patents Intel holds in that area, where we optimized hardware based upon that distinction. You can also see it in LL/LR parsing algorithms. Wherever you have an embedded non-terminal, you get a bushy state as you process the "first" items of that non-terminal. If the grammar is LL(1), that set of bushy states is 1 long, if LL(2) 2 bushy states, etc. A similar effect applies to captures and back-references. In fact, that's my next "project" (should I ever find I have spare time). Captures (and back-references) work like non-terminals. The regular expression in the capture is the definition of the non-terminal. The main distinction is that the capture and its back-reference need to "match" the same string. You can do that either by comparing them or if you are really ambitious, building a "stringy" finite machine on the fly, maybe using Boyer-Moore techniques to optimize it. Suffix tries might be useful there. -------- There is very interesting work in this area. However, I wouldn't use much of it for lexical analysis. the reason being simple. Tokens in lexical analysis all touch (abut) each other. The closest you get to this area (aside from error correction) is dealing with whitespace and comments which you can treat like spacers, but it is really overkill for that. I suppose, if you have an indentation sensitive language, you might look at the counted spacer work Michella did for inspiration, but again it is likely overkill. This work made sense when optimizing Snort (or maybe something ClamAV like, but that got all morphed into comparing hashes ("signatures")). Snort really abuses regular expressions to do long matches (and grammar things) but without tokenizing. So, this applies. However, anyone writing a "compiler" that way should have their head examined, because it makes a slow expensive mess out of it. Kind regards, Chris -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------