Groups > comp.compilers > #2442 > unrolled thread

FIRST_k, FOLLOW_k, k>1

Started by	Andy <borucki.andrzej@gmail.com>
First post	2020-02-06 10:43 -0800
Last post	2020-02-08 18:18 -0800
Articles	7 — 4 participants

Back to article view | Back to comp.compilers

  FIRST_k, FOLLOW_k, k>1 Andy <borucki.andrzej@gmail.com> - 2020-02-06 10:43 -0800
    Re: FIRST_k, FOLLOW_k, k>1 Andy <borucki.andrzej@gmail.com> - 2020-02-06 14:16 -0800
      Re: FIRST_k, FOLLOW_k, k>1 Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-02-08 11:00 +0100
        Re: FIRST_k, FOLLOW_k, k>1 Andy <borucki.andrzej@gmail.com> - 2020-02-08 11:54 -0800
          Re: FIRST_k, FOLLOW_k, k>1 Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-02-09 02:46 +0100
    FIRST_k, FOLLOW_k, k&gt;1 Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-02-07 16:12 +0200
    Re: FIRST_k, FOLLOW_k, k>1 honey crisis <gaztoast@gmail.com> - 2020-02-08 18:18 -0800

#2442 — FIRST_k, FOLLOW_k, k>1

From	Andy <borucki.andrzej@gmail.com>
Date	2020-02-06 10:43 -0800
Subject	FIRST_k, FOLLOW_k, k>1
Message-ID	<20-02-004@comp.compilers>

If LL(l) and LR(k) need sets FIRST_k, FOLLOW_k, k>1, for example LR(3) need this sets of degree 3?
How make it? How is the best structure for these sets? I think about pyramid:
- one bit - is epsilon
- bit table size n
- bit table size n^2
...
- bit table size n^k

because sets degree of k also have shorter strings.

This exponentially grows, in real grammars, number of tokens can be
quite big, ~80, It need then 0(80^k) bits. This sets will sparse and
is better organization of substrings in sets? For example, not bit set
but set of trees or DFA's ?

[toc] | [next] | [standalone]

#2443

From	Andy <borucki.andrzej@gmail.com>
Date	2020-02-06 14:16 -0800
Message-ID	<20-02-005@comp.compilers>
In reply to	#2442

I search examples,
E->aWbXYcdZ
W->w
X->x
Y->y
Z->z

if for FIRST(k=4) wil be: E={awbx} W={wbxy} X={xycd} Y={ycdz} Z={z}
what is convention?
E is obviously {awbx} but W is (difficult){wbxy} or only {w}?
FIRST(k=4)(W) we compute for alone W or W in context whole E?

similar:
E->aWbXYcdZ
W->w
X->x
Y->y
Y->
Z->z
;y->epsilon
will?
FIRST3: E={awb} W={wbx} X={xyc xcd} Y={ycd cdz} Z={z} ;dotyczy też X
FIRST4: E={awbx} W={wbxy wbxc} X={xycd xcdz} Y={ycdz cdz} Z={z}

E->aWbXYcdZ
>E->X
W->w
X->x
Y->y
Y->
Z->z
will?
FIRST1: E={a x} W={w} X={x} Y={y c} Z={z}
FIRST2: E={aw x} W={wb} X={xy xc x} Y={yc cd} Z={z}
FIRST3: E={awb x} W={wbx} X={xyc xcd x} Y={ycd cdz} Z={z}
FIRST4: E={awbx x} W={wbxy wbxc} X={xycd xcdz x} Y={ycdz cdz} Z={z}
FOLLOW1: E={$} W={b} X={y c $} Y={c} Z={$}
FOLLOW2: E={$} W={bx} X={yc cd $} Y={cd} Z={$}
FOLLOW3: E={$} W={bxy bxc} X={ycd cdz $} Y={cdz} Z={$}
FOLLOW4: E={$} W={bxyc bxcd} X={ycdz cdz $} Y={cdz} Z={$}

etc

[toc] | [prev] | [next] | [standalone]

#2445

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2020-02-08 11:00 +0100
Message-ID	<20-02-007@comp.compilers>
In reply to	#2443

Am 06.02.2020 um 23:16 schrieb Andy:
> I search examples,
> E->aWbXYcdZ
> W->w
> X->x
> Y->y
> Z->z
>
> if for FIRST(k=4) wil be: E={awbx} W={wbxy} X={xycd} Y={ycdz} Z={z}
> what is convention?

Your grammar definitely is LL(1). You should provide a grammar that
requires longer lookahead.

IMO the FIRST set covers all *different* sequences, in your case
FIRST(E)={a} hence LL(1).
Only if there exist multiple alternatives starting with 'a' they have to
be listed as e.g. FIRST(E)={ax, ay}.

Dunno about the FOLLOW set, perhaps it can stay LL(1)?

DoDi

[toc] | [prev] | [next] | [standalone]

#2446

From	Andy <borucki.andrzej@gmail.com>
Date	2020-02-08 11:54 -0800
Message-ID	<20-02-008@comp.compilers>
In reply to	#2445

W dniu sobota, 8 lutego 2020 15:36:39 UTC+1 użytkownik Hans-Peter Diettrich
napisał:
> IMO the FIRST set covers all *different* sequences, in your case
> FIRST(E)={a} hence LL(1).
> Only if there exist multiple alternatives starting with 'a' they have to
> be listed as e.g. FIRST(E)={ax, ay}.

Suppose that A is non-starting nonterminal.
FIRST_k(A) is set of token strings lengths 1..k derived from A or derived from
start symbol and begins at A begin?
for example, if we have E->AB, First_4(A) are also strings from B? (if in
First_4(A) exists shorter strings than 4)
It seems that in both situation Follow_4(A) will complicated and contains all
strings from B,C,D... for E->ABCD (if exists in A,B,C shorter stings)

[toc] | [prev] | [next] | [standalone]

#2447

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2020-02-09 02:46 +0100
Message-ID	<20-02-009@comp.compilers>
In reply to	#2446

Am 08.02.2020 um 20:54 schrieb Andy:
> W dniu sobota, 8 lutego 2020 15:36:39 UTC+1 użytkownik Hans-Peter
Diettrich
> napisał:
>> IMO the FIRST set covers all *different* sequences, in your case
>> FIRST(E)={a} hence LL(1).
>> Only if there exist multiple alternatives starting with 'a' they have to
>> be listed as e.g. FIRST(E)={ax, ay}.
>
> Suppose that A is non-starting nonterminal.
> FIRST_k(A) is set of token strings lengths 1..k derived from A or derived
from
> start symbol and begins at A begin?

What's the purpose of exploding the FIRST and FOLLOW sets of a LL(1)
grammar into something more complicated?

IMO you misunderstand the classification of LL grammars and languages :-(

DoDi

[toc] | [prev] | [next] | [standalone]

#2444 — FIRST_k, FOLLOW_k, k>1

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2020-02-07 16:12 +0200
Subject	FIRST_k, FOLLOW_k, k>1
Message-ID	<20-02-006@comp.compilers>
In reply to	#2442

> If LL(l) and LR(k) need sets FIRST_k, FOLLOW_k, k>1, for example LR(3) need this sets of degree 3?
> How make it? How is the best structure for these sets? I think about pyramid:
...
> For example, not bit set but set of trees or DFA's ?

I've been meaning to write a paper on this for awhile, so I'll give
you a very rough draft here.
We implemented a version of it in Yacc++, although my understanding of
it was much more
crude at that time.

Everything you need to know is in the LR(0) machine, which is
essentially a DFA (with a stack).
The problem is you need to disentangle it.

Let's take a reduce-reduce conflict. You are in some state. I diagram
states by showing their
dotted item, using Yacc++ notation for the rules.
That state looks something like this:

state r-r:
x: a b c .; // the dotted item says we will reduce x in this state
y: z c.; // we also reduce a y in this state.

So, now we need to know where the goto-function will take us if we
reduce an x and where it
will take us if we reduce a y.
Now, in Yacc++ we know where we started the productions for x and y,
we call them dot-states
(as they are the states with the dotted item before an x or a y).
There is often more than one dot state for a given non-terminal. But,
we will assume just one
for ease of exposition.

state dot-x: // There will often be a set of these, one for every
context in which x was used.
x1: a .x b; // if see an x, goto state goto-x;

state goto-x:
x1: a x .b c; // b is the lookahead if you see x.

state dot-y:
y1: a b .y b; // if see a y, goto state goto-y;

state goto-y:
y1: a b y .b d; // b is the lookahead for y.

If you look carefully, you see that the goto states have dots before
(transitions on) exactly the
tokens that are lookaheads for the reductions that conflict.
And we have a conflict because the token b is in both states (sets).

If the two states are disjoint, have no tokens in common, then there
will be no conflict the tokens
in goto-x will be the lookaheads for reducing x and the tokens in
goto-y will be the lookaheads
for reducing y.

Now, if both states goto-x and goto-y are the same, it is an ambiguity
not just a conflict.
There will be two ways to parse some sentence (with an x or a y),
because once you get to the
shared goto state, the parse will be the same for both.

However, if there is a conflict (some tokens appear in both goto-x and
goto-y and they are not
the same state), then you need to look for another token of lookahead.
You do that by taking the transition(s) on the common tokens and
seeing what that state that
takes you to.

state goto-x-b:
x1: a x b .c; // bc is the lookahead if you see x.

state goto-y-b:
y1: a b y b .d; // bd is the lookahead if you see y

You continue doing this until you have disambiguated the lookahead or
you have proven the
grammar is truly ambiguous for some conflict.

There are a few more considerations.

Now, if one of those goto states is also a reduce state, you have to
apply the rules to backtrack
through calling rule.

e.g.

state dot-y;
y1: a b .y;

state dot-y1;:
y2: c .y1 b;

state goto-y1:
y2: c y1 .b;

If you merged states (ala LALR), you may need to un-merge (split) them
to resolve conflicts,
that are LALR only and not-LR.

You also have to deal with loops in the states.
If there is a finite k lookahead that solves the conflict, this is trivial.
You simply are doing an induction on k as you step through.
However, there are grammars that would be GLR parsable (and not
ambiguous) that include
loops.
That takes more work to recognize, although I am convinced that you
can compute a closure
that resolves it (and when I have proven that I will do the formal
paper on this).
What you are building could be called a lookahead automaton,
especially if you build something
that accepts regular expressions (or even more powerfully LR-type grammars).

But, this is the basic outline.

You can even use this to construct an LL (or recursive descent)
parser. You simply need to remember to do "calls" when you get to
states where you have computed a closure and you are processing one of
the non-terminals you have expanded in doing so. We do this in Yacc++
so we can handle variable length productions in ELR grammars without
backtracking, counting, or lookbacks.

And, you need one more trick if you have nullable productions (that's
another paper I need to write).

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2448

From	honey crisis <gaztoast@gmail.com>
Date	2020-02-08 18:18 -0800
Message-ID	<20-02-010@comp.compilers>
In reply to	#2442

Antlr used to use trees (don't know if it still does) to represent its lookahead with LL(k)

This paper gets into it:
https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1311&context=ecetr

[toc] | [prev] | [standalone]

csiph-web

FIRST_k, FOLLOW_k, k>1

Contents

#2442 — FIRST_k, FOLLOW_k, k>1

#2443

#2445

#2446

#2447

#2444 — FIRST_k, FOLLOW_k, k&gt;1

#2448

#2444 — FIRST_k, FOLLOW_k, k>1