Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Flex is the most powerful lexical analysis language in the world. True or False? Date: Thu, 5 May 2022 15:20:07 +0300 Organization: Compilers Central Lines: 132 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-05-007@comp.compilers> References: <22-05-003@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="97953"; mail-complaints-to="abuse@iecc.com" Keywords: flex, history Posted-Date: 05 May 2022 15:12:28 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2993 First, of all, I think Flex is a relatively fine lexer generator with some good properties and reasons why it is popular, but claiming it is "the most powerful lexical analysis language in the world" is way too far overstating it. It is not even as powerful as some other lexical analysis languages and even exploiting its power often requires things that cannot be expressed in Flex alone nor can they be done in ways that are simple and easy to reason about. So, that part of the statement is patently false. I will answer the question, but go beyond simple, yes and no responses to show the actual issues and not some over-simplification of the problem. > 1. A lexical analysis language that exclusively provides regular expressions > for scanning input can only process regular languages. True if you mean automata theoretic regular expressions, i.e. not including PCRE extensions, like back references and captures and zero-width assertions. However, if you include such extensions (or other ones I will mention below), you can escape those limits and get to larger language classes, including ones that are Turing Complete--not that Turing Machine power cannot be abused and result in a lexer who you cannot figure out what the tokens are without solving the Halting Problem. So, what one really wants is a nicely constrained but larger lexing language that is still easy to reason about. There are tools that approximate that. None are perfect yet (if perfection is even achievable and not merely a limit to be approached but never reached). > 2. Flex provides, in addition to regular expressions, states and a pushdown > stack. This greatly expands the set of languages that can be processed. > 3. Because Flex provides states and a pushdown stack, Flex lexers can process > context-free languages. True and true. States by themselves don't add anything. The intersection or union of two regular expression languages is still regular. However, a pushdown automata can allow you to process a context free language. Now, not having used that feature myself, I don't know whether one has to "program" the automata by hand or one can express it "naturally" as context free rules. I suspect the latter, but don't know for sure. In comparison, in 1986, when we built the lexer for our version of Yacc++, we integrated regular expressions and LR rules, so that one can simply write yacc-like rules with recursion to express [deterministic, i.e. ELR] grammars (with regular expressions on the right hand side). To my mind, the notation being unified and used both in the lexer and the parser is easier to comprehend and within reason is much easier to reason about. It makes things like nested comments into an "obvious" recursive lexical rule. We also have lexer classes which allow one to do lexer states (for context sensitivity) with inheritance between them and you can change their states (which class is being lexed) via rules in the lexer or parser, doing the things that Flex does. Terence Parr liked our idea and eventually incorporated it into ANTLR (originally doing an LL version of it), the current version in ANTLR4 is more like a PEG (parsing expression grammar) with extensions to handle simple (aka direct) left-recursion and precedence. Moreover, he has some simple lexer extensions that look quite FLEX like (rules for changing "state" in a controlled fashion including pushing/popping from a stack). Being more constrained, it is actually easier to reason about. You aren't trying to analyze arbitrary C (Java) code. > 4. No other lexical analysis language provides states and a pushdown stack. > 5. Flex is the most powerful lexical analysis language in the world. False and false. See the preceding paragraphs. Not only do they do so. They do so in a manner which is more easy to reason about. Moreover, syntactic (and semantic) predicates (an innovation Terence introduced and we borrowed later) allow one to express languages that are larger than the context free family. In fact, there are proofs that one can use them to express any Turing Complete language. They are typically implemented using backtracking. However, Bryan Forward took the concept of predicates and developed packrat parsing as a way of implemented PEGs efficiently, effectively linear. And, the notation is only minorly different than context free grammars, with the main difference being that the or (alternative) is treated as ordered, if the first rule matches apply it and don't even try the following rules. As I noted above, this nicely solves certain precedence problems. (And, as I said, ANTLR4 incorporates this idea to positive effect.) This work has further been extended by Alexander Okhotin into a variety of "Conjuctive" and "Boolean" grammars. These again have Turing Machine power, but are usually simpler to understand and reason about than most Turing Machines. In a different direction researchers like Quinn-Tyler Jackson have developed Adaptive Grammars (see the relevant Wikipeida article). This is a different solution to the problem of making a constrained version of Turing Machine power. Notably they are well designed to handle "type checking" with the equivalent of the C typedef hack, but without it being a hack but a formalism that can give clear semantics. Note that this is traditionally very hard to do in context free grammars--and as far as I can tell, most people writing lexers and parsers flounder trying to incorporate even simple type checking into their grammar, by using the wrong tool (context free grammars) to do so. There are other directions that have been pursued also. Scanner-less parser generators, which merge the lexing and parsing into one phase. GLR parsers (and presumably lexers) that deal with ambiguity by making forests rather than just trees/DAGs. Parsing combinators can be used for lexing also. There are probably even concepts that I have never read about or have read about and forgotten. Many of these advances have been incorporated into lexer generators. They are [almost] all more powerful lexical analysis languages than what Flex provides. Most of them are more clear and easier to reason about also. So, pardon my knee-jerk reaction to your question. It is simply that most people aren't aware of all the different technologies that are available. Flex is fine, but it is certainly not a be-all-end-all solution. It wouldn't even be the solution I would reach for first. Kind regards, Chris -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------