Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: RE: What stage should entities be resolved?
Date: Thu, 10 Mar 2022 12:54:01 +0200
Organization: Compilers Central
Lines: 84
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-026@comp.compilers>
References: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="38032"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 11 Mar 2022 14:49:25 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2930

There is a simple rule of thumb that answers your question.

The lexer deals with all issues related to characters.
The parser deals with all issues related to tokens. It should rarely
look at the spelling (characters) of a token.
The semantic analysis phase should deal with all issues related to the
AST.  It should rarely look at individual tokens.

At each phase, the preceding phase should have resolved the issues at
the lower level and at worse converted them into a unique number
associated with the output units.

So, for example from lexer to parser:
That is a character string can be distinguished from an integer from
an identifier from a keyword after the lexer and each character string
and integer and identifier should have a unique id number, but the
value of the integer should rarely be looked at by the parser (except
in some very odd and rare cases, e.g. only an integer zero might be
allowed in certain contexts, so zero can be distinguished from 42 but
the parser doesn't need to know the value of 42, just that it is a
different value than 43.  So, 42 and 43 should be the same kind of
token, just with different unique ids.  And if 42.0 and 42.000 are the
same, then they should be the same kind of tokens (and may or may not
need to have the same unique id).  You may or may not care about that
distinction until much later, e.g. in the optimizer or code generator.

Similarly, from parser to semantic analysis.
The trees built should be identified such that the semantic analysis
doesn't generally look at the specific tokens, just the trees
themselves. Is this an expression or a statement?  Does the "type" of
the expression map the "type" of variable that is being assigned to,
or do we need to insert a conversion or signal a semantic error?

Each phase has its own complexities.  You don't want to be carrying
the complexities of one phase over to the next.  Sometimes you have no
choice but to do so, but you want to make that rare.

That said, you also want to keep each phase as simple as possible.
Notice how in my previous posting I mentioned splitting the scanner
and the screener into separate parts.  So, it might be that resolving
special character constructs (e.g. &amp;) might be done outside the
scanner per se, but still part of lexing).  Figure out what makes the
code easiest to read.  And, with that implemented and having a working
system, then do the measurements that tell you whether you have a
bottleneck somewhere that you need to fix (possibly by putting code
together that was originally distinct).  But having a working and
understandable first cut should be your first priority.  Then, you
have something you can use to validate your refactorings against and
make certain you haven't taken an invalid shortcut that breaks
something.

Moreover, most of the time, the simplest code is also the fastest.
You will likely spend more time in the scanner than nearly any other
phase, because it has to touch each character.  So, the less is has to
do with the majority of tokens the better.  Simple code that gets the
text divided into tokens and roughly categorizes them will be fast at
doing so.  Then if you have tokens like character strings that need
special processing, recognizing "escape sequences", you can fix them
up in the code that deals only with character strings, and before you
hand those strings off to the parser.  Just as you can map identifiers
into keywords after figuring out what the identifier is.  This is the
point of having a "screener" after the scanner, but part of the lexer.

-------

One final comment, related to a different post but relevant here.
There are times, you might push the result later.  The "go" "to"
versus "goto"  example might be such a case.  You can easily make the
parser reconize both a two keyword sequence "go" "to" and a one
keyword sequence "goto" are synonymous and trying to patch the tokens
back in the lexer can be more complex, e.g. some bizarre dialect of
BASIC:

if i < 5 go to 100;
versus
for i = go to end { a[i] =2*i }

-
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------