Groups > comp.compilers > #103 > unrolled thread

Maintaining scope while parsing C with a YACC grammar

Started by	eliben <eliben@gmail.com>
First post	2011-04-25 05:14 -0700
Last post	2011-05-06 10:43 -0700
Articles	9 — 6 participants

Back to article view | Back to comp.compilers

  Maintaining scope while parsing C with a YACC grammar eliben <eliben@gmail.com> - 2011-04-25 05:14 -0700
    Re: Maintaining scope while parsing C with a YACC grammar Robert A Duff <bobduff@shell01.TheWorld.com> - 2011-04-26 12:22 -0400
      Re: Maintaining scope while parsing C with a YACC grammar Robert A Duff <bobduff@shell01.TheWorld.com> - 2011-04-26 14:08 -0400
      Re: Maintaining scope while parsing C with a YACC grammar eliben <eliben@gmail.com> - 2011-04-28 23:20 -0700
        Re: Maintaining scope while parsing C with a YACC grammar Robert A Duff <bobduff@shell01.TheWorld.com> - 2011-05-02 20:19 -0400
          Re: Maintaining scope while parsing C with a YACC grammar "Ira Baxter" <idbaxter@semdesigns.com> - 2011-05-13 17:46 -0500
          Maintaining scope while parsing C with a Yacc grammar Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 20:43 -0400
        Re: Maintaining scope while parsing C with a YACC grammar torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-05-03 09:51 +0200
    Re: Maintaining scope while parsing C with a YACC grammar Paul B Mann <paul@paulbmann.com> - 2011-05-06 10:43 -0700

#103 — Maintaining scope while parsing C with a YACC grammar

From	eliben <eliben@gmail.com>
Date	2011-04-25 05:14 -0700
Subject	Maintaining scope while parsing C with a YACC grammar
Message-ID	<11-04-036@comp.compilers>

Hello,

Suppose I'm parsing C with a YACC-like grammar. While parsing this
code:

int main()
{
  int a = 1;
  {  /* internal scope */
    int b = 2;
  }
}

I want to know that the declaration "int b = 2" happens inside an
internal scope. My problem is that the way YACC grammars (and bottom-
up parsing in general) are structured, "int b = 2" is analyzed
*before* the whole block { int b = 2;} so I've seen "int b = 2" before
I've seen the complete block. What is the usual solution to this
problem?

Thanks in advance

[There's a couple of possibilities.  One is to add separate rules for
the open and close braces with action code that increments and
decrements the nesting level, so you know when you reduce the
declaration what scope it is in.  Another is to parse the whole thing
into an AST without trying to interpret the symbols other than making
pointers to a generic symbol table entry per name, then walk the AST
and add the scope and type info.  Perhaps people will have other
suggestions. -John]

[toc] | [next] | [standalone]

#105

From	Robert A Duff <bobduff@shell01.TheWorld.com>
Date	2011-04-26 12:22 -0400
Message-ID	<11-04-038@comp.compilers>
In reply to	#103

eliben <eliben@gmail.com> writes:

> [There's a couple of possibilities.  One is to add separate rules for
> the open and close braces with action code that increments and
> decrements the nesting level, so you know when you reduce the
> declaration what scope it is in.  Another is to parse the whole thing
> into an AST without trying to interpret the symbols other than making
> pointers to a generic symbol table entry per name, then walk the AST
> and add the scope and type info.  Perhaps people will have other
> suggestions. -John]

I strongly recommend the "build a tree" solution.  It might seem like
a lot of trouble at first, but it will simplify things in the long
run.

Do all the interesting work during a subsequent walk of the tree.  Or
multiple walks.

I'd do the "making pointers to a generic symbol table..." part during
the tree walk, too, except that C requires some sort of kludgery to
deal with typedefs.

- Bob
[The point of the generic symbol table stuff is that you have to remember the
names somehow, and that seems less awful than doing a strdup() for each
name and hanging the strings off the AST. -John]

[toc] | [prev] | [next] | [standalone]

#106

From	Robert A Duff <bobduff@shell01.TheWorld.com>
Date	2011-04-26 14:08 -0400
Message-ID	<11-04-039@comp.compilers>
In reply to	#105

Robert A Duff <bobduff@shell01.TheWorld.com> writes:

> [The point of the generic symbol table stuff is that you have to
 remember the names somehow, and that seems less awful than doing a
> strdup() for each name and hanging the strings off the AST. -John]

Oh, yeah, sure -- I didn't understand what you meant.

You want a table of identifiers, so multiple occurrences are just an
index into that table or whatever.  But no particular semantic
information attached to those identifiers.

I once wrote a compiler-like tool that represented identifiers
as an index into the source code.  But I later decided that was
a dumb idea.

- Bob

[toc] | [prev] | [next] | [standalone]

#112

From	eliben <eliben@gmail.com>
Date	2011-04-28 23:20 -0700
Message-ID	<11-05-003@comp.compilers>
In reply to	#105

On Apr 26, 7:22 pm, Robert A Duff <bobd...@shell01.TheWorld.com>
wrote:
> eliben <eli...@gmail.com> writes:
> > [There's a couple of possibilities.  One is to add separate rules for
> > the open and close braces with action code that increments and
> > decrements the nesting level, so you know when you reduce the
> > declaration what scope it is in.  Another is to parse the whole thing
> > into an AST without trying to interpret the symbols other than making
> > pointers to a generic symbol table entry per name, then walk the AST
> > and add the scope and type info.  Perhaps people will have other
> > suggestions. -John]
>
> I strongly recommend the "build a tree" solution.  It might seem like
> a lot of trouble at first, but it will simplify things in the long
> run.
>
> Do all the interesting work during a subsequent walk of the tree.  Or
> multiple walks.

Since it's parsing of C I'm talking about, this approach will have to
somehow handle ambiguity of this kind:

T * x;

This can be either a declaration or a multiplication, depending on
earlier symbol table information (whether T is a type or not).

So are you proposing to build an ambiguous AST that contains *both*
parses and resolve between them in later passes? Or just pick one and
maybe modify it later? Do you have references (papers, books, etc.)
explaining this technique?

Thanks in advance

[toc] | [prev] | [next] | [standalone]

#116

From	Robert A Duff <bobduff@shell01.TheWorld.com>
Date	2011-05-02 20:19 -0400
Message-ID	<11-05-007@comp.compilers>
In reply to	#112

eliben <eliben@gmail.com> writes:

> On Apr 26, 7:22 pm, Robert A Duff <bobd...@shell01.TheWorld.com>
> wrote:
>> I strongly recommend the "build a tree" solution.  It might seem like
>> a lot of trouble at first, but it will simplify things in the long
>> run.
>>
>> Do all the interesting work during a subsequent walk of the tree.  Or
>> multiple walks.
>
> Since it's parsing of C I'm talking about, this approach will have to
> somehow handle ambiguity of this kind:
>
> T * x;

Right, that's what I meant by:

    ...except that C requires some sort of kludgery to
    deal with typedefs.

in my earlier post.

> This can be either a declaration or a multiplication, depending on
> earlier symbol table information (whether T is a type or not).
>
> So are you proposing to build an ambiguous AST that contains *both*
> parses and resolve between them in later passes?

I have never seen a C parser that does that.  I've seen Ada parsers
that do that (Ada has an ambiguous grammar, too).  The idea is that
you create a single tree node that represents "this or that", and
during semantic analysis, you transform it into a "this" or a "that"
tree.  It works well for Ada.  I don't know how well it would work for
C.

The typical approach for C is to do as you say -- put typedefs in the
symbol table, and have them affect the parse (or really, the lexer).

In any case, I stand by my statement, 'I strongly recommend the "build a
tree" solution.' even for C.  Defer as much as possible of semantic
analysis until after parsing, where "as much as possible" is not
necessarily "all of it".

>... Or just pick one and maybe modify it later?

That's possible, but kind of kludgy.

>...Do you have references (papers, books, etc.)
> explaining this technique?

No, sorry.  I'd be interested to hear if anybody built a C parser that
didn't use any feedback into the parser from a symbol table to deal
with the typedef issue.

- Bob

[toc] | [prev] | [next] | [standalone]

#120

From	"Ira Baxter" <idbaxter@semdesigns.com>
Date	2011-05-13 17:46 -0500
Message-ID	<11-05-011@comp.compilers>
In reply to	#116

"Robert A Duff" <bobduff@shell01.TheWorld.com> wrote in message
> No, sorry.  I'd be interested to hear if anybody built a C parser that
> didn't use any feedback into the parser from a symbol table to deal
> with the typedef issue.

Our DMS Software Reengineering Toolkit does exactly this.
(www.semanticdesigns.com/Products/DMS/DMSToolkit.html).

We use GLR parsing machinery, which means we can decouple parsing from
semantic hackery.  The grammar for this is just as you'd expect, with
productions for expression statements and declarations.  The GLR
parser builds an ambiguous tree with maximal sharing of the subtrees.
At the end of the parse, you have a parse DAG with ambiguity nodes.

Name resolution is handled by an attribute grammar evaluator with some
special effects.  The AGE is pretty general and we use it for lots of
tree-structured analyses; this is pretty nice for nested scopes,
building control flow graphs, etc.  For name and type resolution, the
attributes passed up and down the tree are, as you might expect,
reference to symbol scopes and/or type declarations.  The special
effects of interest here are the ability to declare an aborted
attribute evaluation at any tree node; this will kill off any
dependent attribute computation and delete the offending tree node and
its otherwise-unreferenced children (remember, there may be sharing,
so a node may have more than one reference).  What we do to handle the
T* x issue is check for consistency of the types; if T is a type, then
T*x as an expression makes no sense and that part of the AGE dies off.
If T is numeric type, then T* x makes no sense as a declartion, and
that part of the AGE dies off.  Destroying the "bad" subtree removes
the incorrect parse from the tree, and thus the final tree has no
ambiguity nodes and a completed name/type resoultion with symbol
tables that is correct.  The C parser is question has been used in
anger to carry out static analysis of software systems with usome
19,000 compilation units.  We think it is quite robust.

This scheme works really nicely for C, C++, COBOL and other languages,
and it means we can build grammars without worrying about semantic
constraints.  The grammars are thus nice and clean, and the AGEs are
pretty clean too because of the separation of concerns.

You can do this by mixing symbol lookup and parsing, but then it
generally gets a lot messier to build both.

--
Ira Baxter, CTO
www.semanticdesigns.com

[toc] | [prev] | [next] | [standalone]

#145

From	Chris F Clark <cfc@shell01.TheWorld.com>
Date	2011-06-12 20:43 -0400
Message-ID	<11-06-019@comp.compilers>
In reply to	#116

I'm sorry I missed this discussion.  I did see the comments by Ira
Baxter and Paul Mann though and found them very informative.  The one
thing that it prompted me to add is that there results are not
contradictory.

Paul Mann's grammar additions allow one to describe scoping as a
S-attributed grammar problem--that's the class of calculations one can
perform in one pass as one parses with an LR grammar.  This matches
the fact that many languages such as C & Pascal are defined to allow
their declarations to be processed in one pass with define before use
rules.

The system Ira Baxter described was a general attribute solving
system.  Thus, it includes Paul's model, but being more general
requires more work to do so.  The general system does allow parsing
languages like PL/I where the declaration can appear after some (or
all) of the uses.

Hope this helps,
-Chris

******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris

[toc] | [prev] | [next] | [standalone]

#117

From	torbenm@diku.dk (Torben Ægidius Mogensen)
Date	2011-05-03 09:51 +0200
Message-ID	<11-05-008@comp.compilers>
In reply to	#112

eliben <eliben@gmail.com> writes:
> Since it's parsing of C I'm talking about, this approach will have to
> somehow handle ambiguity of this kind:
>
> T * x;
>
> This can be either a declaration or a multiplication, depending on
> earlier symbol table information (whether T is a type or not).

One technique for handling this is to let the lexer access the symbol
table and determine if T is a type name or not and generate different
tokens for these.  The grammar would then have productions somewhat
like

Declaration -> Type non-type-id
             | ...

Type -> type-id
      | Type *
      | ...

Expression -> Expression * Expression
            | non-type-id
            | ...

It becomes much more complicated for real C, but the idea should be
clear enough.

This requires the parser to keep a symbol table for the current scope
available to the lexer.  This table needs not contain full information
for each identifier, just enough to distinguish type names from other
names.

That said, I consider this kind of ambiguity bad language design, as
it is not only hard for a parser to handle, but also hard for a human
reader.  Possible fixes are to make declarations and expressions /
statements non-overlapping syntactically (as in Pascal) or to keep
type names syntactically distinct from variable names, e.g. by making
type names start with upper case letter and variable names start with
lower case letters (as in Haskell).

	Torben
[As Dennis said, "the ice is thin here." -John]

[toc] | [prev] | [next] | [standalone]

#119

From	Paul B Mann <paul@paulbmann.com>
Date	2011-05-06 10:43 -0700
Message-ID	<11-05-009@comp.compilers>
In reply to	#103

There are two independent topics being discussed here.

(1) Scope of variables.
(2) typedef variables.

(1) Scope of variables is solved easily.  In your symbol table you
have to keep track of the level for all variables.  Every time you
see a '{' you have to increment the level.  So the 'a' will be put
into the symbol table as a level 1 variable and the 'b' will be a
level 2.  The bottom-up quality of LALR will not affect this.  The
'a' will be seen first and the 'b' later, so no problem here.

(2) The infamous 'typedef' problem continues to plague the newbie or
part-time LALR grammar hacker, but it was solved way back in 1987 by
an LALR parser generator which used an integrated symbol table and
semantic grammar symbols (e.i. {typedef}).  The state-of-the-art
simple solution works fine with the LRSTAR parser generator (see
http://highperware.com). Here is the simple LALR(1) grammar which
solves this 'typedef' problem:

/* C Typedef Solution. */

   <error>          => error();
   <identifier>    => lookup();  /* Symbol table lookup. */

/* Rules. */

   Input -> [Declaration]... <eof>                +> input_

   Declaration
       ->         VarDecl  [',' Var ]... ';'    +> decl_
       -> typedef VarDecl2 [',' Var2]... ';'    +> typedefdecl_

   VarDecl  -> Type...  Ident

   Var      -> [Ptr]... Ident

   VarDecl2 -> Type...  Ident2

   Var2     -> [Ptr]... Ident2

   Ident    -> <identifier>	          +> ident_(1)

   Ident2   -> <identifier>             +> ident_(1,{typedef})

   Type
          -> SimpleType	                +> type_(1)
          -> Type Ptr

   Ptr    -> '*'                        +> ptr_

   SimpleType
          -> char
          -> int
          -> short
          -> unsigned
          -> {typedef}


It handles the input file:

typedef unsigned int UINT, * UINTPTR;
UNIT a, b, c;
UINTPTR x, y, z;

Note, no hacks or kludges are required, just a state-of-the-art parser
generator.

Paul B Mann

[toc] | [prev] | [standalone]

csiph-web

Maintaining scope while parsing C with a YACC grammar

Contents

#103 — Maintaining scope while parsing C with a YACC grammar

#105

#106

#112

#116

#120

#145

#117

#119