Path: csiph.com!xmission!news.alt.net!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: Languages with optional spaces Date: Sun, 1 Mar 2020 10:07:52 +0200 Organization: Compilers Central Lines: 108 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-03-002@comp.compilers> References: <20-02-015@comp.compilers> <20-02-017@comp.compilers> <20-02-033@comp.compilers> <20-02-034@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="54277"; mail-complaints-to="abuse@iecc.com" Keywords: lex, design, comment Posted-Date: 01 Mar 2020 12:52:31 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2473 I'm sorry if my reply sounded like a criticism of Ev's post. I didn't intend it to be. It was more pointing out that when one does things by hand, one often makes things that are challenging to formalize. I noticed this particular aspect as I was trying to imagine what would need to add to the lexer generator in Yacc++ (you can use lex (or flex) with it, but it also includes its own model which has extensions so that you can write LR rules in your lexer too and a few other things to minimize conflicts, integrate with a symbol table, etc). One of our goals with writing it was to capture in a formalizable way some of the important "hacks" one often used to solve problems (e.g. how to do PL/I style keywords). And this problem looked like one where if the hack could be distilled, it would make sense to do so. Not that I am a big fan of languages with optional spaces, but that doesn't mean that they might not represent something people need to handle. But John, our moderator, and Dodi are right in a particular sense. Certain things were done informally by hand "in the old days". And this actually points at another possible direction to a solution, PEGs (parsing expression grammars). It's a different way of attacking the ambiguous language problem. In PEGs, "or" constructs are not un-ordered, but instead you take them in a specific order the grammar writer chooses and the first one that applies is the correct one. You get somewhat the same effect with lex, where if two rules match you use the one specified lexically first. The only difference is that in lex, the longest match rule overrides that, where in a PEG, the ordering rule overrides (and you control the ordering at the caller level, by how you write the "or" not by the order of the rules that the "or" invokes). In my ambiguous example, a PEG like construct where the first "TO" is the rule that matches is probably not that hard to write. And, PEG rules can actually be added to LL and LR grammars. They came from syntactic predicates that Terence Parr invented (and we added those to Yacc++ by using the idea of "Tunnel Automata" pioneered by Josef Gro"lsch). If one goes down that path, it might be possible to actually capture what people have done by hand. I don't recall whether we added them to the lexer portion of the generator. So, now it's on my to-do list. On Sat, Feb 29, 2020 at 11:48 AM Christopher F Clark wrote: > > "Ev. Drikos" posted an interesting albeit partial solution > to the problem of keywords being part of identifiers in languages with > optional spaces. > I won't include it here. > > The problem is that some keywords can appear at places other than the > beginning of an identifier. > In fact, in the worst case scenario, the language can be ambiguous. > Consider the following "BASIC" program extended with variables that > are more than one letter long > and spaces being optional. > > 10 LET ITO = 1 > 20 LET I = 2 > 30 LET JTOK = 3 > 40 LET K = 4 > 50 FOR N = ITOJTOK > 60 REM AMBIGUOUS FOR N = I TO JTOK > 70 REM OR FOR N = ITOJ TO K > 80 PRINT N; > 90 NEXT N > 100 END > > The problem with such solutions is one is tempted to "fix" them one by > one as they are encountered. > > Maury Markowitz mentioned this in his post > where ATO was considered. > It could be A TO or AT O (presuming that TO and AT are both keywords) > Note that this is even an issue with 1 letter variable names if one > has both keywords. > > As one starts patching up these cases, the "grammar" > (or its recursive descent implementation most likely) > begins to become what I call "ad hack". > > With a GLR parser (or something equivalent in power, e.g. an Earley > parser or CYK) and > a lexer that returns all possible sets of tokenizations one can find > all the relevant parse > trees and then see if only 1 makes semantic sense. > > In the above example, that won't help as both interpretations are > legal programs. > One prints 2 3, the other 1 2 3 4. > > I cannot imagine a programmer being happy with the error message: > LINE 50 AMBIGUOUS STATEMENT. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [So how do you lex PL/I? It's not ambiguous, and I think you never need more than one token lookahead to decide whether a string of letters that looks like a keyword is instead a variable name as in IF IF = THEN; THEN ELSE = IF; ELSE THEN = IF; -John]