Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2705
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Has lexing and parsing theory advanced since the 1970's? |
| Date | 2021-09-18 15:54 +0300 |
| Organization | Compilers Central |
| Message-ID | <21-09-011@comp.compilers> (permalink) |
Roger L Costello <costello@mitre.org> asked: > That said, Flex & Bison is old. Has lexing/parsing theory advanced since the > 1970’s? If yes, are there parser generators available today which are based on > those advances in lexing/parsing theory? Or does Flex & Bison still represent > the state-of-the-art in terms of the underlying theory it uses? Yes, compiler theory has advanced since the 1970s. Multiple token lookahead is now supported by a variety of compiler generators. And, many modern parser generators allow regular expressions on the right-hand-side of rules. Conversely, you see lexer generators that accept context-free (recursive) rules to define tokens (while still allowing regular expressions). One of the key ways you often see multiple token lookahead implemented is in the form of predicated grammars. These are of the form, if you see this, try applying this rule first and if it doesn't match try other rules. Often the lookahead for them is the complete rule, i.e. try this rule and if it matches use it, often a form of backtrack parsing. This has evolved into "Parsing Expression Grammars" (PEGs), where you list the rules in the order you want them to match. And, "packrat parsing" has turned this into a practical method with [near?] linear performance. Thus, in most cases, you aren't paying an N-squared penalty for backtracking. You see a particularly interesting evolution of that in ANTLR4, where you can apply precedence and associativity rules to grammars with left-recursion and still in most cases get an LL-like parser out as the generated result. However, it interprets most rules in a PEG like fashion giving an implicit precedence to them. The result is grammars that are relatively easy to read, and mostly do the obvious correct thing. The GLR approach of making a parsing forest is an alternative to that. It won't be long before someone combines the two approaches. You can specify the order in which to try productions, but the generator will recognize cases where two or more apply and keep enough information around to allow the application choice to be made at reduction time, while allowing one to say in the case of ambiguity, I want this choice to apply or to make the choice be "both" and the semantics will disambiguate it. Beyond that, there are extensions to what language class the parser generators allow. To my mind, the most notable one is adaptive grammars. These are extensions that allow one to add to or modify the rules the grammar supports while the parser is running. The most obvious example is adding new "types" or "keywords" to the grammar. This is a more sophisticated solution to the typedef problem. And can result in grammars where you have properly scoped type analysis all done as part of parsing. Another variant are conjunctive grammars. These allow you to take the intersection of two rules and say match only if the intersection matches. This allows one to make finer grained decisions on what legal sentences are. It's effect is similar (but often more general than) predicated grammars, but similar techniques can be used to implement it. Finally, the regular expression model is suitable to extension with the PCRE operators, in particular captures and back-references, but also assertions (which are essentially predicated) and greedy versus non-greedy matching. I have not yet seen a parser generator that implements that, but it is actually a fairly small tweak to LR matching, captures are simply a special form of non-terminal where all occurrences (within a scope) need to match the same "string". --------- However, beyond all that, which are tweaks to the lexing and parsing engines and generators is a bigger change. Parser generator writers have realized that simply getting a correct parse is only a small fragment of building a front-end. In particular, one doesn't simply want a raw parse tree. One wants an AST that discards irrelevant information and more particular highlights relevant information and makes transformations and traversals easier. There have been several advancements along that line. Steve Johnson (the original yacc author) wrote a paper "Yacc meets C++" which shows how to turn an AST into C++ classes with relevant methods. The result is a very object-oriented view of what an AST is. Alternately, ANTLR incorporates the Visitor and Listener design patterns into the parse trees it builds. This makes semantic traversals much easier. There are also tools like sourcer which treats parse trees as s-expressions and works something like hygienic macros on them, allowing one to write transformation rules to reshape the tree as desired. Even more along those lines are the "pattern matching" (destructuring) operators found in many functional languages. Those also allow one to transform trees at the semantic level. Finally, much of this has gone to formalizing more standardized IRs. Much of the parsing work is now simply taking the input and reshaping it to match an IR used by a sophisticated backend. -- ***************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris
Back to comp.compilers | Previous | Next | Find similar
Has lexing and parsing theory advanced since the 1970's? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2021-09-18 15:54 +0300
csiph-web