Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: State-of-the-art algorithms for lexical analysis?
Date: Mon, 6 Jun 2022 21:16:26 +0300
Organization: Compilers Central
Lines: 62
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-013@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="78500"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 06 Jun 2022 16:01:45 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3051

As our moderator wisely states:

> Regular expressions have the advantage that once you've paid the one-time cost
> of making a DFA, the matching is extremely fast. Yhe lexer is usually
> one of the slowest parts of a compiler since it is the only part that has to
> look at each character of the source program, so this is a place where speed
> matters.

And, for most cases they really are sufficient, and it really behooves
one to stay within those limits. Why? Because when you get a syntax
error at the lexical level, which is surprisingly frequent unless you
never mistype closing quotes, you get whole sections of your code
misparsed and rarely does the compiler error correction help much.
Other single character errors , not . missing or extra ( { [ or ] } )
or ; have similar disastrous effects on program meaning, often not
detected until much later.

And, as I mentioned before, having the lexer be simply a scanner and
putting any extra semantics into a separate screener (per Frank
Deremer's recommendation) makes it all much simpler. You end up with
small state machines with very few states that easily fit in even
small machine caches or can be turned into circuitry, FPGAs or ASICs
that use minimal numbers of gates. Those things can often run as fast
as you can read the text in. And the screener being much less frequent
can do more complex things without imposing a significant penalty. The
screener is essentially running at parser speed and only looking at
"long" tokens not single (or double) character ones.

And sadly, you cannot go very much faster. Too often the transitions
occur at single character boundaries. One is lucky when it is a
two-character sequence and longer sequences terminating a token are
rare enough to be in the measurement noise. I know because I tried to
adapt the Boyer-Moore ideas once (skip and reverse) and found that
they were essentially ineffective for tokenization. They might apply
occasionally in parsing, but that's not as much of a performance hog.

Unless you are interested in dealing with nested comments or something similar,
you don't need a stack in your lexer and so no reason to do LL or LR parsing.
(Yes, we extended our Yacc++ lexer to do LR parsing but with special casing so
that the stack cost was only there if you had recursive productions and only
tracked the start of the recursive production so that you were staying in DFA
mode essentially all the time.  And, while that helped us in a few cases, it
isn't something I would say was important nor recommend.)  The only place
I might have found it interesting is if we made it recognize tokens inside of
strings or comments for use in error correction to help with the missing close
character cases.  That might have made it worthwhile.  But that would probably
have needed to be done only in the presence of syntax errors with a string or
comment in the recent context.

In fact, there is only thing that I have not seen a DFA lexer do that I think is
worth doing at the lexical level (and not via a screener).  That is recognizing
tokens the start with a length prefix, e.g. 10Habcdefhij.  Such tokens are
common in things like network protocols and they would be relatively easy
to implement, but I've not seen it done.

Beyond that it is my relatively firm belief that one should almost always
have only simple regular expressions, e.g. that the one for floating point
numbers should be one of the most complex ones.  Otherwise you are trying
to do too much in the scanner.  And you are asking for trouble when you do.

Kind regards,
Chris