Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: Languages with optional spaces Date: Sat, 29 Feb 2020 11:48:41 +0200 Organization: Compilers Central Lines: 60 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-02-033@comp.compilers> References: <20-02-015@comp.compilers> <20-02-017@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="16923"; mail-complaints-to="abuse@iecc.com" Keywords: lex, history Posted-Date: 29 Feb 2020 12:33:24 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2470 "Ev. Drikos" posted an interesting albeit partial solution to the problem of keywords being part of identifiers in languages with optional spaces. I won't include it here. The problem is that some keywords can appear at places other than the beginning of an identifier. In fact, in the worst case scenario, the language can be ambiguous. Consider the following "BASIC" program extended with variables that are more than one letter long and spaces being optional. 10 LET ITO = 1 20 LET I = 2 30 LET JTOK = 3 40 LET K = 4 50 FOR N = ITOJTOK 60 REM AMBIGUOUS FOR N = I TO JTOK 70 REM OR FOR N = ITOJ TO K 80 PRINT N; 90 NEXT N 100 END The problem with such solutions is one is tempted to "fix" them one by one as they are encountered. Maury Markowitz mentioned this in his post where ATO was considered. It could be A TO or AT O (presuming that TO and AT are both keywords) Note that this is even an issue with 1 letter variable names if one has both keywords. As one starts patching up these cases, the "grammar" (or its recursive descent implementation most likely) begins to become what I call "ad hack". With a GLR parser (or something equivalent in power, e.g. an Earley parser or CYK) and a lexer that returns all possible sets of tokenizations one can find all the relevant parse trees and then see if only 1 makes semantic sense. In the above example, that won't help as both interpretations are legal programs. One prints 2 3, the other 1 2 3 4. I cannot imagine a programmer being happy with the error message: LINE 50 AMBIGUOUS STATEMENT. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [I get the impression that more often than not, whoever wrote the interpreter didn't give it much thought so the grammar is whatever the 6502 code did thirty years ago. Fortran was ugly but at least it wasn't ambiguous and at each point the lexer knew what tokens were valid. -John]