Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: Languages with optional spaces
Date: Sat, 29 Feb 2020 11:48:41 +0200
Organization: Compilers Central
Lines: 60
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-02-033@comp.compilers>
References: <20-02-015@comp.compilers> <20-02-017@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="16923"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, history
Posted-Date: 29 Feb 2020 12:33:24 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2470

"Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution
to the problem of keywords being part of identifiers in languages with
optional spaces.
I won't include it here.

The problem is that some keywords can appear at places other than the
beginning of an identifier.
In fact, in the worst case scenario, the language can be ambiguous.
Consider the following "BASIC" program extended with variables that
are more than one letter long
and spaces being optional.

10 LET ITO = 1
20 LET I = 2
30 LET JTOK = 3
40 LET K = 4
50 FOR N = ITOJTOK
60 REM AMBIGUOUS FOR N = I TO JTOK
70 REM OR FOR N = ITOJ TO K
80 PRINT N;
90 NEXT N
100 END

The problem with such solutions is one is tempted to "fix" them one by
one as they are encountered.

Maury Markowitz <maury.markowitz@gmail.com> mentioned this in his post
where ATO was considered.
It could be A TO or AT O (presuming that TO and AT are both keywords)
Note that this is even an issue with 1 letter variable names if one
has both keywords.

As one starts patching up these cases, the "grammar"
(or its recursive descent implementation most likely)
begins to become what I call "ad hack".

With a GLR parser (or something equivalent in power, e.g. an Earley
parser or CYK) and a lexer that returns all possible sets of
tokenizations one can find all the relevant parse trees and then see
if only 1 makes semantic sense.

In the above example, that won't help as both interpretations are
legal programs.
One prints 2 3, the other 1 2 3 4.

I cannot imagine a programmer being happy with the error message:
LINE 50 AMBIGUOUS STATEMENT.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------
[I get the impression that more often than not, whoever wrote the interpreter
didn't give it much thought so the grammar is whatever the 6502 code did thirty
years ago. Fortran was ugly but at least it wasn't ambiguous and at each
point the lexer knew what tokens were valid. -John]