Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #3060 > unrolled thread
| Started by | Roger L Costello <costello@mitre.org> |
|---|---|
| First post | 2022-06-09 14:52 +0000 |
| Last post | 2022-06-12 14:10 +0000 |
| Articles | 7 — 7 participants |
Back to article view | Back to comp.compilers
The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? Roger L Costello <costello@mitre.org> - 2022-06-09 14:52 +0000
Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? "Alexei A. Frounze" <alexfrunews@gmail.com> - 2022-06-09 18:07 -0700
Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? gah4 <gah4@u.washington.edu> - 2022-06-09 23:01 -0700
Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2022-06-10 12:26 +0200
The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-06-11 23:45 +0300
Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? George Neuner <gneuner2@comcast.net> - 2022-06-11 18:15 -0400
Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-06-12 14:10 +0000
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-06-09 14:52 +0000 |
| Subject | The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? |
| Message-ID | <22-06-023@comp.compilers> |
Hi Folks, Page 84-85 of the dragon book [1] says: There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing. 1. Simpler design is perhaps the most important consideration. The separation of lexical analysis from syntax analysis often allows us to simplify one or the other of these phases. For example, a parser embodying the conventions for comments and white space is significantly more complex than one that can assume comments and white space have already been removed by a lexical analyzer. It we are designing a new language, separating the lexical and syntactic conventions can lead to a cleaner overall language design. 2. Compiler efficiency is improved. A separate lexical analyzer allows us to construct a specialized and potentially more efficient processor for the task. A large amount of time is spent reading the source program and partitioning it into tokens. Specialized buffering techniques for reading input characters and processing tokens can significantly speed up the performance of a compiler. 3. Compiler portability is enhanced. Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical analyzer. The representation of special or non-standard symbols, such as the up-arrow in Pascal, can be isolated in the lexical analyzer. Specialized tools have been designed to help automate the construction of lexical analyzers and parsers when they are separated. ---------------------------- Those seem like compelling reasons for separating the lexical analysis from parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them? /Roger [1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.
[toc] | [next] | [standalone]
| From | "Alexei A. Frounze" <alexfrunews@gmail.com> |
|---|---|
| Date | 2022-06-09 18:07 -0700 |
| Message-ID | <22-06-027@comp.compilers> |
| In reply to | #3060 |
On Thursday, June 9, 2022 at 9:50:50 AM UTC-7, Roger L Costello wrote: > 2. ... > A large amount of time is spent reading the source program and partitioning it > into tokens. Specialized buffering techniques for reading input characters and > processing tokens can significantly speed up the performance of a compiler. In any decent compiler this amount of time is relatively small compared to optimizations. Unless we're talking about JIT, which is a different kind of compiling. > 3. Compiler portability is enhanced. Input alphabet peculiarities and other > device-specific anomalies can be restricted to the lexical analyzer. The > representation of special or non-standard symbols, such as the up-arrow in > Pascal, can be isolated in the lexical analyzer. ASCII is supported out of the box nowadays. Unicode is available and simply parsing and storing Unicode code points is straightforward (processing Unicode text, especially displaying, is problematic, but that should hardly affect a programming language or its compiler much). > Those seem like compelling reasons for separating the lexical analysis from > parsing, ... > [1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman. The book is old and doesn't quite reflect the current state of things, IMO. Alex [My impression is that the lexer can often take significant time since it has to look at every character in the input, but the parser is fast unless you're doing something strange like very ambiguous Earley parsing. -John]
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2022-06-09 23:01 -0700 |
| Message-ID | <22-06-030@comp.compilers> |
| In reply to | #3060 |
On Thursday, June 9, 2022 at 9:50:50 AM UTC-7, Roger L Costello wrote: > Page 84-85 of the dragon book [1] says: > There are several reasons for separating the analysis phase of compiling into > lexical analysis and parsing. OK, no-one else has said it yet, so I will. With small memories and slow computers 70 or 60 years ago, the time and space were important. Then we had microcomputers, where we had to relearn everything learned on mainframes years before, so that was 50 or 40 years ago. But by 30 or 20 years ago, memories were big and computers fast. While programs needing compiling have also gotten larger, I am not convinced that either speed or size is much of a problem now. The separation of tasks, modular programming, often makes sense, so it is likely still reasonable to keep them separate. But I don't believe because of space or size. (Well, maybe running compilers on an Arduino Nano still have a size or space limit. But most don't do that.) There is still the old saying: "premature optimization is the root of all evil"
[toc] | [prev] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@netscape.net> |
|---|---|
| Date | 2022-06-10 12:26 +0200 |
| Message-ID | <22-06-032@comp.compilers> |
| In reply to | #3060 |
On 6/9/22 4:52 PM, Roger L Costello wrote:
> Those seem like compelling reasons for separating the lexical analysis from
> parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?
I cannot answer the ANTLR question but want to point out why IMO
(traditional) languages based on tokens should have a tokenizer in the
first place.
We encountered a problem with scannerless parsers and traditional
programming languages in Meta§. A tokenizer can use a couple of
whitespace or control characters or tokens as token separators. Without
such a rule it's very hard to flag all places in a grammar where
whitespace is *required* as a token separator.
As an example: in an "else if" sequence whitespace is required between
"else" and "if" while in other context e.g. "else{" whitespace is not
required. This problem may be solved by a regex, but why should a
grammar be inflated by adding a token termination clause to *each* keyword?
Fortran is another example where keyword recognition requires
backtracking or similar means due to ignorance of spaces e.g. in the
well known "DO10I = 1," snippet.
DoDi
[A separate lexer also makes it a lot easier to skip comments. For Fortran
you had to do a prepass to see whether a statement was an assignment or something
else but after that you could tokenize without backtracking. -John]
[toc] | [prev] | [next] | [standalone]
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Date | 2022-06-11 23:45 +0300 |
| Message-ID | <22-06-037@comp.compilers> |
| In reply to | #3060 |
Since, Terence borrowed that idea from our version of Yacc++, I feel qualified to answer. (And we in turn borrowed plenty of ideas from ANTLR, so it's all fair.) First, they aren't really merged. The Scannerless parser people often merge the idea, but ANTLR still has them separate and generates a lexer and a parser as two separate entities and there are slight differences between them (as there should be). And, you can actually do them as separate files (and compile them separately) if you want. All that happens is that you have the two parts using roughly the same notation (and if you don't need the lexer specific features, it is essentially a subset of the parsing language, and vice-versa). So, you learn that notation and you know it for both parts. Moreover, by combining the two parts in one file, you know that the parts "go together" and you have less problems with mismatches, especially not the kind where you update one but then have an "old" version of the other which doesn't quite match. It also allows you to introduce "tokens" (especially "keywords") in the parser. (Note you lose this if you compile the parts separately.) A slightly more advanced version of that allows you to have tokens that are only matched in certain contexts. ANTLR has some of this implemented, so if you have a place in your parser where you want to match ">" but not worry about ">>", you can use the literal token in the grammar and it will override the longest match rule, provided it doesn't introduce a conflict. (You also lose this feature with separate compilation.) So, note that by keeping the implementations separate (they are really two phases), you have kept item 1. Your parser never sees whitepace and comments, unless you want it to. You can still do 2 with the implementation. I don't know whether the ANTLR generated lexer does so (or not). Since ANTLR is Unicode based, 3 is not an issue. All of this is why we merged the two files into one in Yacc++ and ANLTR started doing the same thing. We also merged them because our original target was going to be building a syntax directed editor version of Emacs and these ideas made sense in that regard. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
[toc] | [prev] | [next] | [standalone]
| From | George Neuner <gneuner2@comcast.net> |
|---|---|
| Date | 2022-06-11 18:15 -0400 |
| Message-ID | <22-06-040@comp.compilers> |
| In reply to | #3060 |
On Thu, 9 Jun 2022 14:52:59 +0000, Roger L Costello <costello@mitre.org> wrote: >[elided quote] seem like compelling reasons for separating the lexical analysis >from parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them? > >/Roger > >[1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman. Technically ANTLR does /not/ combine them ... it generates separate lexer and parser. What ANTLR /does/ do is allow specifying both the parser and lexer in a single input file. It also can recognize textual constants within the parser specification and generate lexer rules for them. Note that Yacc and Bison also recognize textual constants in the parser grammar and generate a token id for the (seperately specified) lexer to return. ANTLR goes further only because it can: ANTLR can create a lexer whereas Yacc and Bison cannot. George
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2022-06-12 14:10 +0000 |
| Message-ID | <22-06-042@comp.compilers> |
| In reply to | #3071 |
George Neuner <gneuner2@comcast.net> writes: >Note that Yacc and Bison also recognize textual constants in the parser >grammar and generate a token id for the (seperately specified) lexer >to return. If you have a rule in yacc/bison: E: T '+' T the "token id" for '+' is the ASCII-code of +. Bison generates token ids only for tokens defined with %token. So if you instead write E: T PLUS T you have to define %token PLUS and the value of PLUS is communicated to the scanner through the .tab.h file. Also note that for the last version of yacc that I have seen documentation for, if you have a rule S: L ":=" E there is no token for ":=", but instead what you get is equivalent to S: L ':' '=' E Bison is more capable, you can, e.g., define %token BECOMES ":=" - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web