Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: FIRST_k, FOLLOW_k, k&gt;1
Date: Fri, 7 Feb 2020 16:12:43 +0200
Organization: Compilers Central
Lines: 135
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-02-006@comp.compilers>
References: <20-02-004@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="78562"; mail-complaints-to="abuse@iecc.com"
Keywords: parse
Posted-Date: 07 Feb 2020 22:17:31 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2444

> If LL(l) and LR(k) need sets FIRST_k, FOLLOW_k, k>1, for example LR(3) need this sets of degree 3?
> How make it? How is the best structure for these sets? I think about pyramid:
...
> For example, not bit set but set of trees or DFA's ?

I've been meaning to write a paper on this for awhile, so I'll give
you a very rough draft here.
We implemented a version of it in Yacc++, although my understanding of
it was much more
crude at that time.

Everything you need to know is in the LR(0) machine, which is
essentially a DFA (with a stack).
The problem is you need to disentangle it.

Let's take a reduce-reduce conflict.  You are in some state. I diagram
states by showing their
dotted item, using Yacc++ notation for the rules.
That state looks something like this:

state r-r:
      x: a b c .;  // the dotted item says we will reduce x in this state
      y: z c.;  // we also reduce a y in this state.

So, now we need to know where the goto-function will take us if we
reduce an x and where it
will take us if we reduce a y.
Now, in Yacc++ we know where we started the productions for x and y,
we call them dot-states
(as they are the states with the dotted item before an x or a y).
There is often more than one dot state for a given non-terminal.  But,
we will assume just one
for ease of exposition.

state dot-x:  // There will often be a set of these, one for every
context in which x was used.
      x1:  a .x b;  // if see an x, goto state goto-x;

state goto-x:
      x1: a x .b c;  // b is the lookahead if you see x.

state dot-y:
      y1: a b .y b; // if see a y, goto state goto-y;

state goto-y:
      y1: a b y .b d; // b is the lookahead for y.

If you look carefully, you see that the goto states have dots before
(transitions on) exactly the
tokens that are lookaheads for the reductions that conflict.
And we have a conflict because the token b is in both states (sets).

If the two states are disjoint, have no tokens in common, then there
will be no conflict the tokens
in goto-x will be the lookaheads for reducing x and the tokens in
goto-y will be the lookaheads
for reducing y.

Now, if both states goto-x and goto-y are the same, it is an ambiguity
not just a conflict.
There will be two ways to parse some sentence (with an x or a y),
because once you get to the
shared goto state, the parse will be the same for both.

However, if there is a conflict (some tokens appear in both goto-x and
goto-y and they are not
the same state), then you need to look for another token of lookahead.
You do that by taking the transition(s) on the common tokens and
seeing what that state that
takes you to.

state goto-x-b:
      x1: a x b .c;  // bc is the lookahead if you see x.

state goto-y-b:
      y1: a b y b .d; // bd is the lookahead if you see y

You continue doing this until you have disambiguated the lookahead or
you have proven the
grammar is truly ambiguous for some conflict.

There are a few more considerations.

Now, if one of those goto states is also a reduce state, you have to
apply the rules to backtrack
through calling rule.

e.g.

state dot-y;
    y1: a b .y;

state dot-y1;:
    y2: c .y1 b;

state goto-y1:
    y2: c y1 .b;

If you merged states (ala LALR), you may need to un-merge (split) them
to resolve conflicts,
that are LALR only and not-LR.

You also have to deal with loops in the states.
If there is a finite k lookahead that solves the conflict, this is trivial.
You simply are doing an induction on k as you step through.
However, there are grammars that would be GLR parsable (and not
ambiguous) that include
loops.
That takes more work to recognize, although I am convinced that you
can compute a closure
that resolves it (and when I have proven that I will do the formal
paper on this).
What you are building could be called a lookahead automaton,
especially if you build something
that accepts regular expressions (or even more powerfully LR-type grammars).

But, this is the basic outline.

You can even use this to construct an LL (or recursive descent)
parser.  You simply need to remember to do "calls" when you get to
states where you have computed a closure and you are processing one of
the non-terminals you have expanded in doing so.  We do this in Yacc++
so we can handle variable length productions in ELR grammars without
backtracking, counting, or lookbacks.

And, you need one more trick if you have nullable productions (that's
another paper I need to write).

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------