Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2991 > unrolled thread
| Started by | Roger L Costello <costello@mitre.org> |
|---|---|
| First post | 2022-05-04 11:22 +0000 |
| Last post | 2022-05-07 13:10 -0700 |
| Articles | 9 — 5 participants |
Back to article view | Back to comp.compilers
Flex is the most powerful lexical analysis language in the world. True or False? Roger L Costello <costello@mitre.org> - 2022-05-04 11:22 +0000
Re: Flex is the most powerful lexical analysis language in the world. True or False? Tom Shields <thomas.evans.shields@gmail.com> - 2022-05-04 14:14 -0500
Flex is the most powerful lexical analysis language in the world. True or False? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-05-05 15:20 +0300
RE: Flex is the most powerful lexical analysis language in the world. True or False? Roger L Costello <costello@mitre.org> - 2022-05-06 11:16 +0000
RE: Flex is the most powerful lexical analysis language in the world. True or False? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-05-07 13:15 +0300
Simple Lexer and Simple Parser [ was RE: Flex is the most powerful lexical analysis language in the world. True or False? ] Roger L Costello <costello@mitre.org> - 2022-05-08 13:34 +0000
Re: Flex is the most powerful lexical analysis language in the world. True or False? George Neuner <gneuner2@comcast.net> - 2022-05-06 11:00 -0400
Re: Flex is the most powerful lexical analysis language in the world. True or False? gah4 <gah4@u.washington.edu> - 2022-05-06 14:30 -0700
Re: fun with Postscript, was Flex is the most powerful lexical analysis language in the world. True or False? gah4 <gah4@u.washington.edu> - 2022-05-07 13:10 -0700
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-05-04 11:22 +0000 |
| Subject | Flex is the most powerful lexical analysis language in the world. True or False? |
| Message-ID | <22-05-003@comp.compilers> |
Hi Folks, 1. A lexical analysis language that exclusively provides regular expressions for scanning input can only process regular languages. (a) True (b) False 2. Flex provides, in addition to regular expressions, states and a pushdown stack. This greatly expands the set of languages that can be processed. (a) True (b) False 3. Because Flex provides states and a pushdown stack, Flex lexers can process context-free languages. (a) True (b) False 4. No other lexical analysis language provides states and a pushdown stack. (a) True (b) False 5. Flex is the most powerful lexical analysis language in the world. (a) True (b) False /Roger [I think that you could easily graft a state stack into any lexer that has start states. Also, tools like Antlr combine the lexer and parser generators, so they're at least as powerful as flex. -John]
[toc] | [next] | [standalone]
| From | Tom Shields <thomas.evans.shields@gmail.com> |
|---|---|
| Date | 2022-05-04 14:14 -0500 |
| Message-ID | <22-05-005@comp.compilers> |
| In reply to | #2991 |
FYI: RE/flex [https://github.com/Genivia/RE-flex] provides a superset of flex functionality, generates a better implementation of a scanner in C++ than does flex, and, most importantly, is actively maintained. Tom Shields
[toc] | [prev] | [next] | [standalone]
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Date | 2022-05-05 15:20 +0300 |
| Message-ID | <22-05-007@comp.compilers> |
| In reply to | #2991 |
First, of all, I think Flex is a relatively fine lexer generator with some good properties and reasons why it is popular, but claiming it is "the most powerful lexical analysis language in the world" is way too far overstating it. It is not even as powerful as some other lexical analysis languages and even exploiting its power often requires things that cannot be expressed in Flex alone nor can they be done in ways that are simple and easy to reason about. So, that part of the statement is patently false. I will answer the question, but go beyond simple, yes and no responses to show the actual issues and not some over-simplification of the problem. > 1. A lexical analysis language that exclusively provides regular expressions > for scanning input can only process regular languages. True if you mean automata theoretic regular expressions, i.e. not including PCRE extensions, like back references and captures and zero-width assertions. However, if you include such extensions (or other ones I will mention below), you can escape those limits and get to larger language classes, including ones that are Turing Complete--not that Turing Machine power cannot be abused and result in a lexer who you cannot figure out what the tokens are without solving the Halting Problem. So, what one really wants is a nicely constrained but larger lexing language that is still easy to reason about. There are tools that approximate that. None are perfect yet (if perfection is even achievable and not merely a limit to be approached but never reached). > 2. Flex provides, in addition to regular expressions, states and a pushdown > stack. This greatly expands the set of languages that can be processed. > 3. Because Flex provides states and a pushdown stack, Flex lexers can process > context-free languages. True and true. States by themselves don't add anything. The intersection or union of two regular expression languages is still regular. However, a pushdown automata can allow you to process a context free language. Now, not having used that feature myself, I don't know whether one has to "program" the automata by hand or one can express it "naturally" as context free rules. I suspect the latter, but don't know for sure. In comparison, in 1986, when we built the lexer for our version of Yacc++, we integrated regular expressions and LR rules, so that one can simply write yacc-like rules with recursion to express [deterministic, i.e. ELR] grammars (with regular expressions on the right hand side). To my mind, the notation being unified and used both in the lexer and the parser is easier to comprehend and within reason is much easier to reason about. It makes things like nested comments into an "obvious" recursive lexical rule. We also have lexer classes which allow one to do lexer states (for context sensitivity) with inheritance between them and you can change their states (which class is being lexed) via rules in the lexer or parser, doing the things that Flex does. Terence Parr liked our idea and eventually incorporated it into ANTLR (originally doing an LL version of it), the current version in ANTLR4 is more like a PEG (parsing expression grammar) with extensions to handle simple (aka direct) left-recursion and precedence. Moreover, he has some simple lexer extensions that look quite FLEX like (rules for changing "state" in a controlled fashion including pushing/popping from a stack). Being more constrained, it is actually easier to reason about. You aren't trying to analyze arbitrary C (Java) code. > 4. No other lexical analysis language provides states and a pushdown stack. > 5. Flex is the most powerful lexical analysis language in the world. False and false. See the preceding paragraphs. Not only do they do so. They do so in a manner which is more easy to reason about. Moreover, syntactic (and semantic) predicates (an innovation Terence introduced and we borrowed later) allow one to express languages that are larger than the context free family. In fact, there are proofs that one can use them to express any Turing Complete language. They are typically implemented using backtracking. However, Bryan Forward took the concept of predicates and developed packrat parsing as a way of implemented PEGs efficiently, effectively linear. And, the notation is only minorly different than context free grammars, with the main difference being that the or (alternative) is treated as ordered, if the first rule matches apply it and don't even try the following rules. As I noted above, this nicely solves certain precedence problems. (And, as I said, ANTLR4 incorporates this idea to positive effect.) This work has further been extended by Alexander Okhotin into a variety of "Conjuctive" and "Boolean" grammars. These again have Turing Machine power, but are usually simpler to understand and reason about than most Turing Machines. In a different direction researchers like Quinn-Tyler Jackson have developed Adaptive Grammars (see the relevant Wikipeida article). This is a different solution to the problem of making a constrained version of Turing Machine power. Notably they are well designed to handle "type checking" with the equivalent of the C typedef hack, but without it being a hack but a formalism that can give clear semantics. Note that this is traditionally very hard to do in context free grammars--and as far as I can tell, most people writing lexers and parsers flounder trying to incorporate even simple type checking into their grammar, by using the wrong tool (context free grammars) to do so. There are other directions that have been pursued also. Scanner-less parser generators, which merge the lexing and parsing into one phase. GLR parsers (and presumably lexers) that deal with ambiguity by making forests rather than just trees/DAGs. Parsing combinators can be used for lexing also. There are probably even concepts that I have never read about or have read about and forgotten. Many of these advances have been incorporated into lexer generators. They are [almost] all more powerful lexical analysis languages than what Flex provides. Most of them are more clear and easier to reason about also. So, pardon my knee-jerk reaction to your question. It is simply that most people aren't aware of all the different technologies that are available. Flex is fine, but it is certainly not a be-all-end-all solution. It wouldn't even be the solution I would reach for first. Kind regards, Chris -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
[toc] | [prev] | [next] | [standalone]
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-05-06 11:16 +0000 |
| Message-ID | <22-05-009@comp.compilers> |
| In reply to | #2993 |
Thank you Chris for that superb explanation! > Flex is fine, but it is certainly not a be-all-end-all > solution. It wouldn't even be the solution I would > reach for first. May I ask, what would be the solution that you would reach for first? /Roger [I'm not Chris but the answer is the one that does the job. You don't need a flame thrower if you only need to light the BBQ. -John]
[toc] | [prev] | [next] | [standalone]
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Date | 2022-05-07 13:15 +0300 |
| Message-ID | <22-05-018@comp.compilers> |
| In reply to | #2994 |
Roger, since you asked, I will answer what solution I would reach for (and give advice on what I think others should reach for). First, John's advice is straight on. You don't need a flamethrower in most cases, and in fact having one invites one to abuse it. I think Purdue has a series of videos on the fastest way to light a barbeque which end with completely melting the bbq in a matter of seconds using liquid oxygen. Fast, but probably the wrong solution for grilling burgers and hot dogs. Next, if you already have a lexer, I would NOT change the technology using it, at least not by much, unless one had a specific reason to do otherwise. I might switch from original LEX to Flex (and original yacc to Bison), but that would be one of the few exceptions to rule. Now, if I had issues I would switch, but I would first attempt to see if there are workarounds. in my last three projects, there was already a lexer-parser combinartion in use. So, in my last three projects (and four implementations) we used: Bison + Flex, ANTLR, Parser-RS, and JAVACC. In the last two, I don't even know what lexer was used as I never touched it other than to add keywords. And that goes to an important point. Your lexer *should be* almost trivially simple (i.e. regular expressions only and not complicated ones). You rarely want to solve problems at the lexical level. You are much less likely to get good error reporting if you do. In most cases, your parser should be simple also. You might want LR parsing for expressions, but otherwise you want your grammar to be LL(1) (with the if-then-else hack). And, all else being equal, if you don't have a lexer-parser combination you can reuse. I would pick somewhat based upon programming language, since most tools are relatively tied to one language even when they support more than one. I haven't decided on my favorite for Rust yet, parser-RS isn't bad, Nom is also popular. For Java, I would go with ANTLR4. And, overall, I would say that is my current favorite despite a few nits. For C++ or C#, I would use the Yacc++ we wrote even though it needs some tweaking to catch up to ANTLR at this point. I prefer our solution to keywords to what ANTLR has and indirect left recursion (i.e. parsing expressions with lists of expressions as a first argument) doesn't work right in ANTLR. If someone else was paying for it, I would investigate the DMS Toolkit for Semantic Designs, because they have done most of the work to make GLR practical. If I really wanted to solve types in my grammar, I would look into "meta-ess" by Jackson. I don't know how available that is. And, if I were using Scheme or Lisp, I would look into Racket. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
[toc] | [prev] | [next] | [standalone]
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-05-08 13:34 +0000 |
| Subject | Simple Lexer and Simple Parser [ was RE: Flex is the most powerful lexical analysis language in the world. True or False? ] |
| Message-ID | <22-05-022@comp.compilers> |
| In reply to | #2999 |
Thank you again Chris. Terrific information.
Another question if I may. You wrote:
> And that goes to an important point. Your lexer *should be* almost
> trivially simple (i.e. regular expressions only and not complicated
> ones). You rarely want to solve problems at the lexical level. You
> are much less likely to get good error reporting if you do. In most
> cases, your parser should be simple also.
For a while now I have been (for fun) working on building a parser for
parsing XML documents. I have experimented with making the lexer
simple and with making the parser simple. If I make the lexer simple,
then the parser is complex. If I make the lexer complex (using lots of
states and making heavy use of Flex's pushdown stack) then the parser
is simple. It doesn't seem possible to make both the lexer and parser
simple.
There are lots of "conditional rules" in XML. For example, in XML the
& is called an "XML entity." Since the & is a reserved symbol, XML
documents need to use & instead of &. An XML parser is to convert
& to &. However, if the & is in certain contexts -- within a
comment or within a CDATA section -- then the & is not converted.
Thus, there is conditional processing:
IF (& is in a comment or in a CDATA section) THEN
OUTPUT(&)
ELSE
OUTPUT(&)
Flex's states/stack mechanism is ideally suited for conditional
processing like this. From the section on Start Conditions in the Flex
manual: "flex provides a mechanism for conditionally activating
rules."
So while it would be great to have a simple lexer, I am leaning
towards dealing with the conditional rules in XML using the Flex
states/stack mechanism rather than dealing with the conditional rules
in Bison. In other words, I am leaning towards a complex lexer.
I am interested in hearing your thoughts on this.
> You don't need a flamethrower
My apologies. It wasn't my intent to throw a flame. But in hindsight I
can see that I should have worded things much better. I will do better
in the future.
/Roger
[toc] | [prev] | [next] | [standalone]
| From | George Neuner <gneuner2@comcast.net> |
|---|---|
| Date | 2022-05-06 11:00 -0400 |
| Message-ID | <22-05-011@comp.compilers> |
| In reply to | #2991 |
On Wed, 4 May 2022 11:22:34 +0000, Roger L Costello <costello@mitre.org> wrote: >Hi Folks, >1. A lexical analysis language that exclusively provides regular expressions >for scanning input can only process regular languages. >2. Flex provides, in addition to regular expressions, states and a pushdown >stack. This greatly expands the set of languages that can be processed. >3. Because Flex provides states and a pushdown stack, Flex lexers can process >context-free languages. >4. No other lexical analysis language provides states and a pushdown stack. >5. Flex is the most powerful lexical analysis language in the world. >[I think that you could easily graft a state stack into any lexer that has start states. >Also, tools like Antlr combine the lexer and parser generators, so they're at least as >powerful as flex. -John] +1 John. Roger, if you hadn't already asked some more interesting questions, I would suspect this 'test' was homework. Flex is powerful, but it certainly is not alone. As John's response hinted, there are (plenty of other) tools that more or less are equivalent. And not all of them are based on LR. https://en.wikipedia.org/wiki/Comparison_of_parser_generators Not to mention that programming languages which tend to actually be used also tend to be [relatively] easily parsed using LL(k). LR is useful AS AN IMPLEMENTATION TECHNIQUE, but in general if your language is complex enough to really /require/ (G)LR or PEG parsing, it probably is too complicated to be used by average programmers. YMMV, George
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2022-05-06 14:30 -0700 |
| Message-ID | <22-05-015@comp.compilers> |
| In reply to | #2996 |
On Friday, May 6, 2022 at 9:14:54 AM UTC-7, George Neuner wrote: (snip) > Not to mention that programming languages which tend to actually be > used also tend to be [relatively] easily parsed using LL(k). An important part of a programming language is that people can understand it. I suspect it isn't hard to design a language that computers can easily parse, but people can't. Your lexer only needs to be good enough for actual programming languages. As with BBQs, that doesn't stop people from trying. [Take a look at Postscript, which is trivial to tokenize and parse since it's a stream of tokens in RPN order, but making sense of it by humans is a challenge. Or, of course, m4. -John]
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2022-05-07 13:10 -0700 |
| Subject | Re: fun with Postscript, was Flex is the most powerful lexical analysis language in the world. True or False? |
| Message-ID | <22-05-020@comp.compilers> |
| In reply to | #2997 |
(snip, I wrote) > An important part of a programming language is that people can understand it. (snip) > [Take a look at Postscript, which is trivial to tokenize and parse since > it's a stream of tokens in RPN order, but making sense of it > by humans is a challenge. Or, of course, m4. -John] Reminds me of the origin of RISC, when (I believe) IBM noticed that compilers were using a small subset of the available instructions, and that much less programming was being done in assembly. But okay, was Postscript supposed to be written by people, or programs? As with code generated by compilers, it has to be understood by people at least once, but even then, only step by step, and not (usually) the whole program at once. But yes you can write unreadable Postscript. Once, some years ago, we (me and some others) needed to redefine def. [Actually, RISC was at Berkeley, and IBM's project was the 801. But yes, they noticed compilers used only a fraction of the S/360 instruction set so they made a minimal design that supported only what their state-of-the art compiler used. RISC was sort of the same but they used the mediocre PCC compiler which is why they had register windows to compensate for PCC's weak register allocation. -John]
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web