Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Keywords and Reserved Words Date: Tue, 8 Mar 2022 21:46:44 +0200 Organization: Compilers Central Lines: 124 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-017@comp.compilers> References: <22-03-004@comp.compilers> <22-03-009@comp.compilers> <22-03-015@comp.compilers> <22-03-016@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="80364"; mail-complaints-to="abuse@iecc.com" Keywords: syntax, parse, comment Posted-Date: 08 Mar 2022 15:23:51 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2922 > It is also interesting to see how languages without reserved > words, keep track of which words have the keyword meaning, > and which are ordinary names. This is a topic close to my heart. There are actually a variety of methods that work relatively well in this regard. ----------------------------------------------- First, recognizing keywords, if you are using a language (unlike the original FORTRAN where spaces were not significant), it is useful to follow a model proposed by Frank Deremer, with lexing divided into a scanner and a screener. The scanner has only the responsibility of dividing tokens into discrete entities and labelling those which have fixed types, so keyword recognition is not part of the scanner to the scanner they are just identifiers. The screener then checks the identifier to see if it is a keyword. In most languages, that lookup can easily be done by preloading the symbol table with the identifiers that are keywords. This is not inefficient, because even if the identifier is not a keyword, you need a unique symbol table entry for each identifier in any case, so you are doing that lookup anyway. The only wrinkle here is the case (which we have in Yacc++) where you want your keywords to be case insensitive and your identifier to be case sensitive. You do get an extra lookup in that case, but only the first time an identifier is seen. Once, it is seen, you have marked the symbol table entry with whether it is a keyword or not. ----------------------------------------------- Now, the inverse problem. For keywords that are reserved words. There is nothing more to do. However, if you only want your keywords to be recognized as keywords in specific contexts, then you need the second part of the solution. This is a [set of] parser rule[s] that turns keywords back into identifiers if they aren't being used in a context where they have a special meaning. This rule looks something like this: identifier: IDENTIFIER | IF | THEN | ELSE | WHILE | DO | ... ; That is, you define an identifier non-terminal that matches either an IDENTIFIER token or any of your KEYWORD[ token]s. And, when you want an identifier you use the non-terminal rather than token in your rule. Now, this may cause "conflicts" or "ambiguities" and you resolve that by figuring out which keywords are special in that context and making an alternate idenitiier_in_this_context rule, which omits the conflicting keywords from its list. (And, by the way, this is one reason to use a parser generator, because it will report those conflicts to you and tell you when you have some keyword that is still an issue in some context. The checking that a parser generator does helps you not make mistakes.) And, in most cases the parsers lookahead can figure out whether you mean the keyword or the identifier so that you rarely have to make those extra rules that exclude specific keywords from the list. ----------------------------------------------- The main place one gets in trouble is when you have a [set of] keyword[s] that is/are optional before a list of identifiers. In that case, your identifier list cannot start with one of those keywords. However, a better solution (if you are designing your own language) is to put the optional keywords before a mandatory one. For example, in Yacc++ we have a KEYWORD definition, and a variation on it for SUBSTRING keywords. But we don't use SUBSTRING as a declaration by itself, so the rule for the declaration of a list of keywords looks like: keyword_declaration: SUBSTRING? KEYWORD identifier ("=" number)? (","? identifier ("=" number)?)* ";" ; If we did it in the other order, we couldn't allow substring as the first identifier in the list. That is: keyword_declaration: KEYWORD SUBSTRING? identifier_but_not_substring ("=" number)? (","? identifier ("=" number)?)* ";" ; And, note, if we had hand-written a recursive descent parser, we wouldn't have been warned of that edge case. Either we would have mis-parsed it or we would have that weird exception in our language. Neither is ideal in my point of view. And, if you go through it carefully, you can see that you can make a much more complex set of rules that covers the second case, so that if you saw the SUBSTRING keyword or your "substring" identifier was followed by an equals or a comma or a semi-colon it was still legal. Something to introduce subtle bugs in users' programs where a seemingly innocuous change causes something to go from legal (and meaning one thing) to illegal or meaning something else. Or as our wise moderator might put it, you start creating a language where the user is no longer certain what their code means. ----------------------------------------------- By the way, we used both parts in Yacc++. We do have a small number of reserved words, like yy_eof which we use to communicate between lexer and the parser to indicate the end of the input. You cannot use that identifier for any other purpose in our grammars. However, we allow keywords like TOKEN to be used also as names of tokens or non-terminals and we can tell when you are using it to declare a token versus using it as the name of a token or non-terminal. So, most of our keywords are just that context sensitive keywords that only have reserved meanings in certain contexts. Otherwise, they are just identifiers. I guess one can tell that before starting compiler resources, my partner and I had spent a fair amount of time writing and using compilers in PL/I dialects. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [If anyone wants to know how to find the tokens in old space-insensitive Fortran I can tell you, but it's as ugly as you might imagine. -John]