Has lexing and parsing theory advanced since the 1970's?

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	Has lexing and parsing theory advanced since the 1970's?
Date	2021-09-18 15:54 +0300
Organization	Compilers Central
Message-ID	<21-09-011@comp.compilers> (permalink)

Show all headers | View raw

Roger L Costello <costello@mitre.org> asked:

> That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
> 1970’s? If yes, are there parser generators available today which are based on
>  those advances in lexing/parsing theory? Or does Flex & Bison still represent
>  the state-of-the-art in terms of the underlying theory it uses?

Yes, compiler theory has advanced since the 1970s.

Multiple token lookahead is now supported by a variety of compiler
generators.

And, many modern parser generators allow regular expressions on the
right-hand-side of rules. Conversely, you see lexer generators that
accept context-free (recursive) rules to define tokens (while still
allowing regular expressions).

One of the key ways you often see multiple token lookahead implemented
is in the form of predicated grammars. These are of the form, if you
see this, try applying this rule first and if it doesn't match try
other rules. Often the lookahead for them is the complete rule, i.e.
try this rule and if it matches use it, often a form of backtrack
parsing. This has evolved into "Parsing Expression Grammars" (PEGs),
where you list the rules in the order you want them to match. And,
"packrat parsing" has turned this into a practical method with [near?]
linear performance. Thus, in most cases, you aren't paying an
N-squared penalty for backtracking.

You see a particularly interesting evolution of that in ANTLR4, where
you can apply precedence and associativity rules to grammars with
left-recursion and still in most cases get an LL-like parser out as
the generated result. However, it interprets most rules in a PEG like
fashion giving an implicit precedence to them. The result is grammars
that are relatively easy to read, and mostly do the obvious correct
thing.

The GLR approach of making a parsing forest is an alternative to that.
It won't be long before someone combines the two approaches. You can
specify the order in which to try productions, but the generator will
recognize cases where two or more apply and keep enough information
around to allow the application choice to be made at reduction time,
while allowing one to say in the case of ambiguity, I want this choice
to apply or to make the choice be "both" and the semantics will
disambiguate it.

Beyond that, there are extensions to what language class the parser
generators allow.

To my mind, the most notable one is adaptive grammars. These are
extensions that allow one to add to or modify the rules the grammar
supports while the parser is running. The most obvious example is
adding new "types" or "keywords" to the grammar. This is a more
sophisticated solution to the typedef problem. And can result in
grammars where you have properly scoped type analysis all done as part
of parsing.

Another variant are conjunctive grammars. These allow you to take the
intersection of two rules and say match only if the intersection
matches. This allows one to make finer grained decisions on what legal
sentences are. It's effect is similar (but often more general than)
predicated grammars, but similar techniques can be used to implement
it.

Finally, the regular expression model is suitable to extension with
the PCRE operators, in particular captures and back-references, but
also assertions (which are essentially predicated) and greedy versus
non-greedy matching. I have not yet seen a parser generator that
implements that, but it is actually a fairly small tweak to LR
matching, captures are simply a special form of non-terminal where all
occurrences (within a scope) need to match the same "string".

---------

However, beyond all that, which are tweaks to the lexing and parsing
engines and generators is a bigger change. Parser generator writers
have realized that simply getting a correct parse is only a small
fragment of building a front-end.

In particular, one doesn't simply want a raw parse tree. One wants an
AST that discards irrelevant information and more particular
highlights relevant information and makes transformations and
traversals easier.

There have been several advancements along that line.

Steve Johnson (the original yacc author) wrote a paper "Yacc meets
C++" which shows how to turn an AST into C++ classes with relevant
methods. The result is a very object-oriented view of what an AST is.

Alternately, ANTLR incorporates the Visitor and Listener design
patterns into the parse trees it builds. This makes semantic
traversals much easier.

There are also tools like sourcer which treats parse trees as
s-expressions and works something like hygienic macros on them,
allowing one to write transformation rules to reshape the tree as
desired.

Even more along those lines are the "pattern matching" (destructuring)
operators found in many functional languages. Those also allow one to
transform trees at the semantic level.

Finally, much of this has gone to formalizing more standardized IRs.
Much of the parsing work is now simply taking the input and reshaping
it to match an IR used by a sophisticated backend.

--
*****************************************************************************
Chris Clark                  email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris

Back to comp.compilers | Previous | Next | Find similar

Thread

Has lexing and parsing theory advanced since the 1970's? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2021-09-18 15:54 +0300

csiph-web