Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: Please provide a learning path for mastering lexical analysis languages
Date: Sun, 22 May 2022 12:12:32 +0300
Organization: Compilers Central
Lines: 169
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-05-043@comp.compilers>
References: <22-05-010@comp.compilers> <22-05-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48522"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, design
Posted-Date: 22 May 2022 13:00:55 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3016

My apologies for this delayed reply.  The startup I am working for is
nearing the alpha release and I had things on my plate that I needed
to give priority.

Let me start my answer with two simple rules of thumb, in order of importance.

1) Keep your lexer simple.  Follow Einstein's famous suggestion (on a
slightly different topic).  Make it as simple as possible, but no
simpler.

2) Respect the lexer/parser boundary.  Try to avoid mixing the two
phases.  That will help with item 1.

And to that regard, don't look for a more powerful lexing mechanism
when the job should be done in the parser.  The exception is when it
makes the combination simpler (i.e. you make your lexer just a little
more complicated, but it makes the parser much simpler (or vice
versa).

So, I wouldn't go looking for lots of ways to make the lexer more
powerful.  Don't go looking for a flame thrower to light your barbecue
grill.

If you are seriously interested in regular expressions, I would
recommend Friedl's book: Mastering Regular Expressions.  That talks
about nearly all the variations at a detailed level.  But realize that
the PCRE extensions (e.g. backreferences and zero-width assertions)
are not generally available in lexical analysis tools.

Those mechanisms are often overkill.  In Snort, someone used them for
matching quotes around expressions.  Simpler would have been to write
two rules, one for matched 's and one for matched "s.  It probably
would have run faster too.  But someone saw that Snort rules allowed
PCRE expressions and thought they would be clever.

That goes to point 1.  Write simple rules for each token.  And, it is
often better to have more rules that all generate the same token than
to try and make one complicated regular expression that somehow
combines them all.  And, if you find yourself writing a complicated
regular expression or a list of more than 5 different rules for the
same token.  Think twice about the problem you are solving.  You might
be doing the wrong thing at the wrong level.  (Worth noting here that
this point is true for language design.  If your language design
starts forcing you to invent new technologies or use complicated
existing ones, ask if you are actually making the language simple to
use, often it's an indication of a mistake in your language design.
Now, if you are implementing an existing language, you may have no
choice, but realize that the original designer may have made a mistake
and you are just dealing with it.)

By the way, don't be afraid to borrow or steal good ideas and reuse
them.  I have essentially written one lexer in Yacc++ and use it for
all languages with only minor tweaks.  Now, I'm not trying to parse
Fortran 66 with it, but I have used it for Yacc++ itself, C/C++/C#
dialects, Verilog, Pascal, Javascript, BASIC, and lots of tiny tools.
I will likely use it for Rust with only the smallest of tweaks.  I
reuse that lexer for nearly everything.

The structure of that lexer is quite simple.

There are only a small set of rules that are actual regular
expressions: identifiers, numbers (integers and floats distinguished),
strings, whitespace, and comments.

Punctuation is done by simple rules listing the legal "operators".
Note, if I had operators like "&amp;" I would make a regular
expression for that in the first set.

Note that keywords I handle separately.  This is based upon advice
Frank Deremer once gave.  The lexer should be a scanner (regular
expression part) and a screener (looking for keywords and the likes).
Those are two separate tasks and splitting them often makes both
simpler.  (In fact, it's one of my few gripes about ANTLR4, as it
doesn't make doing that easy.  We built that model, split
scanner/screener, into Yacc++ and it was one of our better choices.
You can make a more powerful screener without making the scanner more
complex and vice-versa.  That goes back to the idea of making things
simple by doing only one thing at a time.)

--------

Now, since I know you are working on HTML or something similar, I will
give one piece of advice where you want to use something a little more
powerful.  In markdown type languages, you often have two distinct
portions, the markdown which is a little language that you want to
tokenize and the rest of the text which you don't want to tokenize (or
at least not the same way).  That does suggest using lexer states (or
some similar mechanism).  You are really writing two lexers.  One for
inside markdown and another for the rest of the text.  Don't try to
mix them.  The markdown is usually clearly marked e.g. between < and >
or on lines that start with a special character (#) or something
similer.  So, markdown is one lexer state (or complete lexer) with one
set of rules, while the rest of text is another lexer with a different
set of rules, and the only tricky part is knowing when to switch
between them (and in a good language, it is actually obvious).

But, notice how that is a different implementation of separation of
concerns.  You have two lexers (or two lexer states).  Each one doing
one job and doing it simply.  The hard part is the switching between
them and that's not usually a complex regular expression, but the
requirement to do something that is more "infrastructure" related.
And, see how it answers your question about how to not lex "&amp;" in
text outside of markdown.    You are simply using a different lexer
[state].  It doesn't know about "&amp;" and doesn't attempt to
tokenize it.

------

Since, I mentioned screeners above, I will talk a bit about them.  The
simplest model is a function you can call upon matching an identifer
(or another token you want to screen, e.g. maybe you have comments
that can be "Javadocs" or decorations and you want to process them).
This function can look the token up (or lex and parse it) and decide
to change the token type (or even insert multiple tokens) before
handing the token from the lexer to the parser.  Note, if the function
is complicated, you might be essentially implementing a preprocessor.
(So, "#.*\n" could be recognized as a token, but the screener for it
is the "C preprocessor").   And, the point here is that it is
sometimes useful to insert something between the lexer and the parser
(and maybe only for specific tokens.)  Another use of a screener is
the C "typedef hack" where you look your identifier up in the symbol
table and see if it is a typedef and return a different token for it
in that case.  That allows your screen to take feedback from the
parser without exposing it down to the scanner.  Still, notice the
separation of concerns to keep each part simpler.

-----

Now, once you have a base simple lexer you are often done (or at least
mostly done).  Then, you can worry about the "mistakes".  Comment
processing is often a more complicated expression than one would like.
It's worse if you allow nested comments that need balanced delimiters,
e.g. "/*" and "*/".  Balanced delimiters are a Dyck language and take
you out of the regular expression world.  You can do them with lexer
states (if you can stack them), but it is easier if you simply have a
CFG language (LL, LR, or PEG) to do so.  Captures and backreferences
can sometimes do it, provided they have a stack like quality.

We had a similar issue in Yacc++, we handle nested braces as a single
token "code".  That makes the parser simpler.  That token has matched
brackets, strings, and comments inside.   It is much more complicated
than one really wants.  It is essentially a mini-parser inside our
lexers.  And, it means we can't really use braces as tokens
themselves.  It's a sign that as language designers we make a mistake.
  It probably would have been better to use a lexical state model, see
a brace "{" and go into a separate lexer to solve it.   Sometimes
having a flame thrower is not the right solution.

------

A slightly different mistake was made in the original yacc design.
They make semicolons optional at the end of the rule.  This means you
[sometimes] need "identifier:" to recognize the start of a rule.  That
makes the lexer more complex.  Other languages have played with
treating newline as a semicolon which has its own issue, but these are
often at the parser level not the lexer.

I'm sure I have more to say on this topic, but also I have said more
than enough....

Kind regards,
Chris

******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------