State-of-the-art algorithms for lexical analysis?

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	State-of-the-art algorithms for lexical analysis?
Date	2022-06-06 21:16 +0300
Organization	Compilers Central
Message-ID	<22-06-013@comp.compilers> (permalink)
References	<22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers>

Show all headers | View raw

As our moderator wisely states:

> Regular expressions have the advantage that once you've paid the one-time cost
> of making a DFA, the matching is extremely fast. Yhe lexer is usually
> one of the slowest parts of a compiler since it is the only part that has to
> look at each character of the source program, so this is a place where speed
> matters.

And, for most cases they really are sufficient, and it really behooves
one to stay within those limits. Why? Because when you get a syntax
error at the lexical level, which is surprisingly frequent unless you
never mistype closing quotes, you get whole sections of your code
misparsed and rarely does the compiler error correction help much.
Other single character errors , not . missing or extra ( { [ or ] } )
or ; have similar disastrous effects on program meaning, often not
detected until much later.

And, as I mentioned before, having the lexer be simply a scanner and
putting any extra semantics into a separate screener (per Frank
Deremer's recommendation) makes it all much simpler. You end up with
small state machines with very few states that easily fit in even
small machine caches or can be turned into circuitry, FPGAs or ASICs
that use minimal numbers of gates. Those things can often run as fast
as you can read the text in. And the screener being much less frequent
can do more complex things without imposing a significant penalty. The
screener is essentially running at parser speed and only looking at
"long" tokens not single (or double) character ones.

And sadly, you cannot go very much faster. Too often the transitions
occur at single character boundaries. One is lucky when it is a
two-character sequence and longer sequences terminating a token are
rare enough to be in the measurement noise. I know because I tried to
adapt the Boyer-Moore ideas once (skip and reverse) and found that
they were essentially ineffective for tokenization. They might apply
occasionally in parsing, but that's not as much of a performance hog.

Unless you are interested in dealing with nested comments or something similar,
you don't need a stack in your lexer and so no reason to do LL or LR parsing.
(Yes, we extended our Yacc++ lexer to do LR parsing but with special casing so
that the stack cost was only there if you had recursive productions and only
tracked the start of the recursive production so that you were staying in DFA
mode essentially all the time.  And, while that helped us in a few cases, it
isn't something I would say was important nor recommend.)  The only place
I might have found it interesting is if we made it recognize tokens inside of
strings or comments for use in error correction to help with the missing close
character cases.  That might have made it worthwhile.  But that would probably
have needed to be done only in the presence of syntax errors with a string or
comment in the recent context.

In fact, there is only thing that I have not seen a DFA lexer do that I think is
worth doing at the lexical level (and not via a screener).  That is recognizing
tokens the start with a length prefix, e.g. 10Habcdefhij.  Such tokens are
common in things like network protocols and they would be relatively easy
to implement, but I've not seen it done.

Beyond that it is my relatively firm belief that one should almost always
have only simple regular expressions, e.g. that the one for floating point
numbers should be one of the most complex ones.  Otherwise you are trying
to do too much in the scanner.  And you are asking for trouble when you do.

Kind regards,
Chris

Thread

State-of-the-art algorithms for lexical analysis? Roger L Costello <costello@mitre.org> - 2022-06-05 20:53 +0000
  Re: State-of-the-art algorithms for lexical analysis? gah4 <gah4@u.washington.edu> - 2022-06-05 16:05 -0700
    Re: State-of-the-art algorithms for lexical analysis? Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2022-06-06 08:59 +0200
      State-of-the-art algorithms for lexical analysis? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-06-06 21:16 +0300
        Re: State-of-the-art algorithms for lexical analysis? Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2022-06-07 06:52 +0200
          Re: State-of-the-art algorithms for lexical analysis? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-06-07 19:40 +0300
            Re: State-of-the-art algorithms for lexical analysis? Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2022-06-08 05:32 +0200
              Re: counted strings, was State-of-the-art algorithms for lexical analysis? gah4 <gah4@u.washington.edu> - 2022-06-09 11:54 -0700
                Re: counted characters in strings "Robin Vowels" <robin51@dodo.com.au> - 2022-06-10 12:21 +1000
                Re: counted characters in strings Martin Ward <martin@gkc.org.uk> - 2022-06-11 10:52 +0100
                Re: counted characters in strings drb@msu.edu (Dennis Boone) - 2022-06-11 11:09 -0500
    Re: State-of-the-art algorithms for lexical analysis? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-06-06 16:00 +0000
    References for PSL ? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-06-06 20:11 +0300

csiph-web