Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: RE: Why no shift-shift conflicts.
Date: Sun, 30 Jan 2022 03:14:05 +0200
Organization: Compilers Central
Lines: 130
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-01-120@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="17138"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, LR(1)
Posted-Date: 29 Jan 2022 20:41:01 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2877

Our esteemed moderator was right in saying that this is part of how
the LR family of algorithms work. In state machines, you can build a
state that represents two (or more) rules (productions). So,
shift-shift conflicts don't occur in the LR machine. You only care
about conflicts when you need to reduce. More on that a bit later in
this reply.

In LL(k) algorithms, you don't get to do that (past the k-th token of
lookahead at the beginning of the rule. So, when your LL(k) parser
generator tells you that you need to left-factor your two rules. And,
it usually does so by declaring your grammar as "ambiguous" which it
might not be, just not LL(k). It has actually declared a shift-shift
conflict, just not labelling it as that, as the LL people don't think
of advancing the token index through the rule as shifting (although in
an LR parser, that's what it is called because of how the stack is
handled in LR parsing).

Next, I want to talk about this example:

X: '1' { a = a + 1; } '2' 'x';
Y: '1' { a = a - 1; } '2' 'y';
Z: '1' '2' 'z' { a = 0 };

[The mid-rule action is a cheat, a shorthand for this:

X: '1' x1 '2' 'x' ;
x1: { a = a + 1; } ;

Y: '1' y1 '2' 'y' ;
y1: { a = a - 1; } ;

That's why those actions create conflicts where there were none before.

-John]

The cheat as our moderator calls it, is *somewhat* an artifact of the
traditional yacc/bison implementation of it. You can build a state
machine model of the rules (we do in "Yacc++", the one by compiler
resources as opposed to the C++ port of yacc, which is often refered
to by the same name although not usually in initial cap spelling. I
only mention the distinction lest someone use the latter and say, wait
it doesn't do that) where you haven't introduced those artificial
non-terminals (x1 and y1) and thus don't have to reduce them. If so,
you can make the conflict "go away" at least in terms of what the
parser needs to do.

However, that in itself is somewhat misleading. In that, while you can
make that work at the grammar level, you will have changed when the
action code runs. You will postpone running it until after the lexer
has processed and returned to the parser both the "2" token and the
"x" or "y" token following it. If one is depending upon that action to
influence how the lexer works (e.g. changing the lexer state or
something similar), the action will occur too late.

Thus, although Yacc++ can move the action code later in the processing
of the parser's state machine, it still wants to report a
warning/error to tell the user that the action had to be delayed as
the two actions were in conflict when it shifted the "1" token. Now,
we don't call this a shift-shift conflict, but rather an "action code
conflict" because it is the two actions that are in conflict, not the
shifting. And, worth noting, if the two actions are the same
(textually), we don't even move it, because both transitions want to
do the same thing and there is no "action code conflict".

Now, getting back to LL(k) parsers, they mask this problem. Because
before they start running a rule, they get to look-ahead (i.e. read
from the lexer) up to k tokens, so they will get the "1" "2" and "x"
or "y" tokens (k=3) before running the rule at all. Thus, they have
the same issue in interacting with the lexer, but haven't told the
grammar writer about it.

Now, about conflicts at reduce time. At reduce time, an LR(k) parser
gets to read up to k tokens of lookahead before deciding whether to
reduce or not. If the generator is LALR(k), there are some additional
restrictions on that, but not that important to what I am going to
say. So, in this case, a LR parser generator can look k tokens past
the end of the rule (or past the action code, with the above mentioned
caveat) before declaring a conflict.

And, particularly worth noting are the GLR variants of the LR
algorithm. In that case, the parser generator can simply generate a
parse forest, effectively postponing the decision (potentially until
the entire file has been processed) and only then noting that one
still has a forest and not a tree that the grammar is ambiguous. Now,
this has the positive effect of eliminating spurious conflict
messages, because the parser can handle the grammar by generating a
parse forest if needed. But, it also has the disadvantage of removing
the check that at parser generation time the grammar has been detected
as being potentially ambiguous. Now, for some grammars, this can be
resolved and the grammar be shown to be ambiguous or not. However, in
general, such detection is equivalent to solving the halting problem.
(The proof of that being true is beyond what I can explain.)

Finally, I want to mention something about Syntactic (and Semantic)
predicates, as they were a part of this discussion earlier. They are
somewhat an alternate solution to this problem. In fact, with the
right model, they can be used to turn (even an LL) parser generator
into a "universal turing machine", because they allow one to write
backtracking parsers (with unlimited backtracking) and the allow one
to effectively give oneself unlimited look-ahead before deciding which
rule to apply and at the same time fine grained control over the order
the rules are examined in. This is perhaps Terence Parr's greatest
invention. In fact, it is so important that a whole new branch of
parsing technologies "Parsing Expression Grammars" (PEGs) have evolved
from it. Now, without the proper checks, it still has the problem of
allowing ambiguous grammars, but with the twist that it guarantees an
unambiguous parse tree (not a parse forest).

Now, at some point, all of these advances in parsing technology will
be merged (it is on my to do list, to do so, but I keep having other
things to finish first). If done right, one potentially has a parser
generator that will warn you if your grammar is ambiguous and yet
allow you to process it in an unambiguous manner or get out a parse
forest of all the possible interpretations (with fine grained control
to get exactly the level of ambiguity you want handled in each
manner). And, if really done right, it will give you something that
looks like hand- written recursive descent code in most cases, and
will allow you to edit that code and get the grammar tweaks back with
the appropriate sanity checks. We have the theory to do so, but it
isn't a one month project, so it remains just a dream. Of course, I
won't be surprised if there are still people hacking on hand-written
recursive descent parsers even if the dream comes to fruition.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------