Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Keywords and Reserved Words
Date: Tue, 8 Mar 2022 21:46:44 +0200
Organization: Compilers Central
Lines: 124
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-017@comp.compilers>
References: <22-03-004@comp.compilers> <22-03-009@comp.compilers> <22-03-015@comp.compilers> <22-03-016@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="80364"; mail-complaints-to="abuse@iecc.com"
Keywords: syntax, parse, comment
Posted-Date: 08 Mar 2022 15:23:51 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2922

> It is also interesting to see how languages without reserved
> words, keep track of which words have the keyword meaning,
> and which are ordinary names.

This is a topic close to my heart.  There are actually a variety of
methods that work relatively well in this regard.

-----------------------------------------------

First, recognizing keywords, if you are using a language (unlike the
original FORTRAN where spaces were not significant), it is useful to
follow a model proposed by Frank Deremer, with lexing divided into a
scanner and a screener.  The scanner has only the responsibility of
dividing tokens into discrete entities and labelling those which have
fixed types, so keyword recognition is not part of the scanner to the
scanner they are just identifiers.  The screener then checks the
identifier to see if it is a keyword.

In most languages, that lookup can easily be done by preloading the
symbol table with the identifiers that are keywords.  This is not
inefficient, because even if the identifier is not a keyword, you need
a unique symbol table entry for each identifier in any case, so you
are doing that lookup anyway.

The only wrinkle here is the case (which we have in Yacc++) where you
want your keywords to be case insensitive and your identifier to be
case sensitive.  You do get an extra lookup in that case, but only the
first time an identifier is seen.  Once, it is seen, you have marked
the symbol table entry with whether it is a keyword or not.

-----------------------------------------------

Now, the inverse problem.  For keywords that are reserved words.
There is nothing more to do.  However, if you only want your keywords
to be recognized as keywords in specific contexts, then you need the
second part of the solution.  This is a [set of] parser rule[s] that
turns keywords back into identifiers if they aren't being used in a
context where they have a special meaning.

This rule looks something like this:

identifier: IDENTIFIER | IF | THEN | ELSE | WHILE | DO | ... ;

That is, you define an identifier non-terminal that matches either an
IDENTIFIER token or any of your KEYWORD[ token]s.  And, when you want
an identifier you use the non-terminal rather than token in your rule.

Now, this may cause "conflicts" or "ambiguities" and you resolve that
by figuring out which keywords are special in that context and making
an alternate idenitiier_in_this_context rule, which omits the
conflicting keywords from its list.  (And, by the way, this is one
reason to use a parser generator, because it will report those
conflicts to you and tell you when you have some keyword that is still
an issue in some context.  The checking that a parser generator does
helps you not make mistakes.)  And, in most cases the parsers
lookahead can figure out whether you mean the keyword or the
identifier so that you rarely have to make those extra rules that
exclude specific keywords from the list.

-----------------------------------------------

The main place one gets in trouble is when you have a [set of]
keyword[s] that is/are optional before a list of identifiers. In that
case, your identifier list cannot start with one of those keywords.
However, a better solution (if you are designing your own language) is
to put the optional keywords before a mandatory one.

For example, in Yacc++ we have a KEYWORD definition, and a variation
on it for SUBSTRING keywords.  But we don't use SUBSTRING as a
declaration by itself, so the rule for the declaration of a list of
keywords looks like:

keyword_declaration:
SUBSTRING? KEYWORD identifier ("=" number)? (","? identifier ("="
number)?)* ";" ;

If we did it in the other order, we couldn't allow substring as the
first identifier in the list.
That is:

keyword_declaration:
KEYWORD SUBSTRING? identifier_but_not_substring ("=" number)? (","?
identifier ("=" number)?)* ";" ;

And, note, if we had hand-written a recursive descent parser, we
wouldn't have been warned of that edge case.  Either we would have
mis-parsed it or we would have that weird exception in our language.
Neither is ideal in my point of view.

And, if you go through it carefully, you can see that you can make a
much more complex set of rules that covers the second case, so that if
you saw the SUBSTRING keyword or your "substring" identifier was
followed by an equals or a comma or a semi-colon it was still legal.
Something to introduce subtle bugs in users' programs where a
seemingly innocuous change causes something to go from legal (and
meaning one thing) to illegal or meaning something else.

Or as our wise moderator might put it, you start creating a language
where the user is no longer certain what their code means.

-----------------------------------------------

By the way, we used both parts in Yacc++.  We do have a small number
of reserved words, like yy_eof which we use to communicate between
lexer and the parser to indicate the end of the input. You cannot use
that identifier for any other purpose in our grammars.   However, we
allow keywords like TOKEN to be used also as names of tokens or
non-terminals and we can tell when you are using it to declare a token
versus using it as the name of a token or non-terminal.  So, most of
our keywords are just that context sensitive keywords that only have
reserved meanings in certain contexts.  Otherwise, they are just
identifiers.  I guess one can tell that before starting compiler
resources, my partner and I had spent a fair amount of time writing
and using compilers in PL/I dialects.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------
[If anyone wants to know how to find the tokens in old space-insensitive Fortran
I can tell you, but it's as ugly as you might imagine. -John]