Path: csiph.com!xmission!news.alt.net!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: Languages with optional spaces
Date: Sun, 1 Mar 2020 10:07:52 +0200
Organization: Compilers Central
Lines: 108
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-03-002@comp.compilers>
References: <20-02-015@comp.compilers> <20-02-017@comp.compilers> <20-02-033@comp.compilers> <20-02-034@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="54277"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, design, comment
Posted-Date: 01 Mar 2020 12:52:31 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2473

I'm sorry if my reply sounded like a criticism of Ev's post.  I didn't
intend it to be.  It was more pointing out that when one does things
by hand, one often makes things that are challenging to formalize.

I noticed this particular aspect as I was trying to imagine what would
need to add to the lexer generator in Yacc++ (you can use lex (or
flex) with it, but it also includes its own model which has extensions
so that you can write LR rules in your lexer too and a few other
things to minimize conflicts, integrate with a symbol table, etc).
One of our goals with writing it was to capture in a formalizable way
some of the important "hacks" one often used to solve problems (e.g.
how to do PL/I style keywords).

And this problem looked like one where if the hack could be distilled,
it would make sense to do so.  Not that I am a big fan of languages
with optional spaces, but that doesn't mean that they might not
represent something people need to handle.

But John, our moderator, and Dodi are right in a particular sense.
Certain things were done informally by hand "in the old days".  And
this actually points at another possible direction to a solution, PEGs
(parsing expression grammars).  It's a different way of attacking the
ambiguous language problem.  In PEGs, "or" constructs are not
un-ordered, but instead you take them in a specific order the grammar
writer chooses and the first one that applies is the correct one.  You
get somewhat the same effect with lex, where if two rules match you
use the one specified lexically first.  The only difference is that in
lex, the longest match rule overrides that, where in a PEG, the
ordering rule overrides (and you control the ordering at the caller
level, by how you write the "or" not by the order of the rules that
the "or" invokes).

In my ambiguous example, a PEG like construct where the first "TO" is
the rule that matches is probably not that hard to write.

And, PEG rules can actually be added to LL and LR grammars.  They came
from syntactic predicates that Terence Parr invented (and we added
those to Yacc++ by using the idea of "Tunnel Automata" pioneered by
Josef Gro"lsch).  If one goes down that path, it might be possible to
actually capture what people have done by hand.  I don't recall
whether we added them to the lexer portion of the generator.    So,
now it's on my to-do list.

On Sat, Feb 29, 2020 at 11:48 AM Christopher F Clark
<christopher.f.clark@compiler-resources.com> wrote:
>
> "Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution
> to the problem of keywords being part of identifiers in languages with
> optional spaces.
> I won't include it here.
>
> The problem is that some keywords can appear at places other than the
> beginning of an identifier.
> In fact, in the worst case scenario, the language can be ambiguous.
> Consider the following "BASIC" program extended with variables that
> are more than one letter long
> and spaces being optional.
>
> 10 LET ITO = 1
> 20 LET I = 2
> 30 LET JTOK = 3
> 40 LET K = 4
> 50 FOR N = ITOJTOK
> 60 REM AMBIGUOUS FOR N = I TO JTOK
> 70 REM OR FOR N = ITOJ TO K
> 80 PRINT N;
> 90 NEXT N
> 100 END
>
> The problem with such solutions is one is tempted to "fix" them one by
> one as they are encountered.
>
> Maury Markowitz <maury.markowitz@gmail.com> mentioned this in his post
> where ATO was considered.
> It could be A TO or AT O (presuming that TO and AT are both keywords)
> Note that this is even an issue with 1 letter variable names if one
> has both keywords.
>
> As one starts patching up these cases, the "grammar"
> (or its recursive descent implementation most likely)
> begins to become what I call "ad hack".
>
> With a GLR parser (or something equivalent in power, e.g. an Earley
> parser or CYK) and
> a lexer that returns all possible sets of tokenizations one can find
> all the relevant parse
> trees and then see if only 1 makes semantic sense.
>
> In the above example, that won't help as both interpretations are
> legal programs.
> One prints 2 3, the other 1 2 3 4.
>
> I cannot imagine a programmer being happy with the error message:
> LINE 50 AMBIGUOUS STATEMENT.
--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------
[So how do you lex PL/I?  It's not ambiguous, and I think you never need more than
one token lookahead to decide whether a string of letters that looks like a
keyword is instead a variable name as in

IF IF = THEN; THEN ELSE = IF; ELSE THEN = IF;

-John]