Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2835 > unrolled thread
| Started by | Roger L Costello <costello@mitre.org> |
|---|---|
| First post | 2022-01-16 14:36 +0000 |
| Last post | 2022-01-23 06:17 +0000 |
| Articles | 11 — 6 participants |
Back to article view | Back to comp.compilers
Are compiler developers light-years ahead of other software development? Roger L Costello <costello@mitre.org> - 2022-01-16 14:36 +0000
Re: Are compiler developers light-years ahead of other software development? Philipp Klaus Krause <pkk@spth.de> - 2022-01-16 22:13 +0100
Re: Are compiler developers light-years ahead of other software development? gah4 <gah4@u.washington.edu> - 2022-01-17 07:14 -0800
Re: Are compiler developers light-years ahead of other software development? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-01-19 21:33 +0000
Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-22 03:01 +0000
Re: Are compiler developers light-years ahead of other software development? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-01-22 10:43 +0000
Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-22 21:38 +0000
Re: Are compiler developers light-years ahead of other software development? Roger L Costello <costello@mitre.org> - 2022-01-22 12:50 +0000
Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-22 21:22 +0000
Re: Are compiler developers light-years ahead of other software development? Ian Lance Taylor <ianlancetaylor@gmail.com> - 2022-01-22 15:40 -0800
Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-23 06:17 +0000
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-01-16 14:36 +0000 |
| Subject | Are compiler developers light-years ahead of other software development? |
| Message-ID | <22-01-059@comp.compilers> |
Hello Compiler Experts! I am learning the algorithms and theory underlying parsing. Wow! I am discovering that parsing has such a rich theoretical foundation: grammars, LL, LR, first() function, following() function, parsing tables, etc. In all the software that I have written, I am lucky if I can use one or two algorithms that have a theoretical foundation. Mostly, the software is one-off code. The mathematician Alfred North Whitehead has this wonderful phrase [1]: ... the slow accumulation of clear and relevant ideas. The parsing community has done that - the community has slowly accumulated clear and relevant ideas. I can't think of any other branch of computer science where there is such a rich foundation upon which to develop software. Are compiler developers light-years ahead of other software development? /Roger [1] An Introduction to Mathematics by Alfred North Whitehead, p. 19.
[toc] | [next] | [standalone]
| From | Philipp Klaus Krause <pkk@spth.de> |
|---|---|
| Date | 2022-01-16 22:13 +0100 |
| Message-ID | <22-01-062@comp.compilers> |
| In reply to | #2835 |
On 16.01.22 15:36, Roger L Costello wrote: > I can't think of any other branch of computer science where there is such a > rich foundation upon which to develop software. There are others. But a compiler is something that most softare developers use, and thus are aware of.
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2022-01-17 07:14 -0800 |
| Message-ID | <22-01-064@comp.compilers> |
| In reply to | #2835 |
On Sunday, January 16, 2022 at 9:26:38 AM UTC-8, Roger L Costello wrote: > Hello Compiler Experts! > I am learning the algorithms and theory underlying parsing. Wow! I am > discovering that parsing has such a rich theoretical foundation: grammars, LL, > LR, first() function, following() function, parsing tables, etc. (snip) > I can't think of any other branch of computer science where there is such a > rich foundation upon which to develop software. Thinking about D.E.Knuth's "The Art of Computer Programming", there is a whole book (volume 3), "Sorting and Searching". A large fraction of important algorithms rely on the foundation of search algorithms. (And if you sort, you can use binary search to find thinks.) Now, some of the book is based on the now lost art of sorting data on magnetic tape, where one has only sequential access to most of the data. Others will tell you that Quicksort is all you need to know, and forget the rest of the book. In any case, I vote for sorting and searching algorithms as the rich theoretical foundation of much of CS. Hash tables are in important part of many compilers. Dynamic programming algorithms are used in many optimization problems, and fundamental to many algorithms. The dynamic programming algorithm used by the Unix diff command, to find an optimal set of differences between two files, originated from biologists comparing protein sequences.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2022-01-19 21:33 +0000 |
| Message-ID | <22-01-074@comp.compilers> |
| In reply to | #2835 |
Roger L Costello <costello@mitre.org> writes: >I am learning the algorithms and theory underlying parsing. Wow! I am >discovering that parsing has such a rich theoretical foundation: grammars, LL, >LR, first() function, following() function, parsing tables, etc. > >In all the software that I have written, I am lucky if I can use one or two >algorithms that have a theoretical foundation. Mostly, the software is one-off >code. ... >Are compiler developers light-years ahead of other software development? There are multiple facets to this question: 1) Compilers are decades ahead of much other software development: If you look at early CS curricula (I have looked at one from IIRC 1965), you see compilers and a few other topics, but many of the topics that now fill the curricula did not even exist then. E.g., BNF was invented before Algol 60 was published in 1960, while the relational model for data bases is from 1970, the concept of NP-completeness is from 1971, and RISC architectures (in the form of the IBM 801) were invented in the mid-1970s. And if you look at currently hot topics like deep learning, cloud computing and IoT, their practical relevance is much younger (although I guess you can point to 1960s work for each of them that points in that direction). 2) As I have written some time ago: |Compilers are ideal software projects: You have relatively stable and |well-defined, partly even formal specifications for the input and the |output, and lots of interesting subproblems in between. | |In particular, the scanning part of compilers has been formalized |using regular languages and is expressed in regular expressions, which |have become not only popular for scanning programming languages, but |for many other tasks as well. The parsing part has been formalized |with context-free grammars in (extended) BNF and that has led to the |development of parser generators that are quite popular in language |implementations. There is also similar formalization in instruction selection (although my impression is that generators are not as widely used there as in the front end). However, in between things are more like more common programming, even though people have tried to propagate the formalization success in these areas, too. And I think that points to why we don't see that much of that kind of formalization in most other areas: context-free grammars are something that most programming language designers want to use, so developing methods for parsing such languages and putting them in parser generators has a significant payoff, while a formal semantics becomes so inflexible that most programming language designers prefer to stick to more or less informal semantics (in some cases with some parts formalized), and then of course you implement the semantics with a less formal approach, too. And for most other software, it's similar. 3) The love of formal stuff can lead you astray. E.g., the first Fortran compiler did not report syntax errors. Of course, mistakes also happen in less formalized software; my point is that if you focus your attention on the nice formal stuff, you can easily overlook that your software has other aspects, too, that you need to cover. - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/ [I am reasonably sure the original Fortran compiler reported syntax errors. A 1957 paper at bitsavers says it did: http://bitsavers.org/pdf/ibm/704/FORTRAN_paper_1957.pdf -John]
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <480-992-1380@kylheku.com> |
|---|---|
| Date | 2022-01-22 03:01 +0000 |
| Message-ID | <22-01-083@comp.compilers> |
| In reply to | #2835 |
On 2022-01-16, Roger L Costello <costello@mitre.org> wrote: > Are compiler developers light-years ahead of other software development? I'd say it's pretty impressive how, say, the Google Assistant can speak completely naturally (and in multiple languages). I mean, details like intonation across entire sentences and such; nothing "robot-like". Fantastic advances have been done in computer graphics in the last 30 years. Compilers don't all use fancy algorithms, or not all of them. Fancy algorithms are specialized, optimized (in particular ways) solutions to problems that have other solutions that are pedestrian. Sometimes the other solutions are actually faster on your input cases. None of the buzzwords you mentioned related to parsing are needed in building a compiler: grammars, LL, LR, first() function, following() function, parsing tables, etc. All that goes away if you write recursive descent parser. The grammar is then in the documentation and code comments only, and somewhat expressed in its code structure. There is no LR, first or follow, no tables. The GNU C++ compiler, undeniably a production compiler and a major/known/widely-used one, has a big recursive-descent parser maintained by hand: over a megabyte of code. In other words, a major compiler for probably the programming language with the most complicated syntax ever, eschews pretty much all that we have learned and accumulated about parsing between around 1968 and now. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal [In principle a r-d parser recognizes an LL(1) grammar, in practice people often cheat a little. -John]
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2022-01-22 10:43 +0000 |
| Message-ID | <22-01-088@comp.compilers> |
| In reply to | #2853 |
Kaz Kylheku <480-992-1380@kylheku.com> writes:
>None of the buzzwords you mentioned related to parsing are needed in
>building a compiler: grammars, LL, LR, first() function, following()
>function, parsing tables, etc.
>
>All that goes away if you write recursive descent parser. The grammar is
>then in the documentation and code comments only, and somewhat expressed
>in its code structure. There is no LR, first or follow, no tables.
As our moderator notes, recursive descent is an implementation
technique for LL(1) grammars. Moreover, you need first-sets to
implement them. If your grammar is
A -> B|C
your recursive-descent parser becomes:
if next_symbol in first(B) then parse_B else parse_C fi
If you want to learn about conflicts in your grammar early on (rather
than finding it out later by having some input be parsed other than
intended), you also need follow-sets: if your grammar is
A -> B|epsilon
there is a conflict if "intersection(first(b),follow(A))" (where "U"
signifies the union) is not empty: if one of the symbols in the
intersection comes up at the start of an A, do you parse B or not?
Of course, if you are freshly designing a programming language, you
may be fine with declaring the actual behaviour of the
redursive-descent parser to be the intended behaviour. But if you
have to implement a given grammar, things become more complex.
You can also design the syntax to make recursive-descent parsing easy,
like Wirth did. E.g., almost all statements start with a keyword, and
there is AFAIK only one statement that may be started by several
symbols (none of them conflicting with one of the other statement
keywords), so you can easily implement statements like this:
if next_symbol=IF_SYM then ifstatement()
else if next_symbol=WHILE_SYM then whilestatement()
else if ...
...
else assignmentstatement()
All statements are followed by ";" or END, making it easy to know that
there are no conflicts with following statements.
These properties also make it easier for humans to understand
grammars, so they are a good idea in syntax design. If you go further
in that direction, you get to Lisp (which does not need a BNF-based
parser) or Forth (where the parser just looks for the next white
space). Of course, if you look at textbooks for these languages, you
find that you have patterns at a different ("static semantics") level,
but at least it's always clear in these language from the syntax which
if an else-branch belongs to.
>The GNU C++ compiler, undeniably a production compiler and a
>major/known/widely-used one, has a big recursive-descent parser
>maintained by hand: over a megabyte of code.
>
>In other words, a major compiler for probably the programming language
>with the most complicated syntax ever, eschews pretty much all that we
>have learned and accumulated about parsing between around 1968 and now.
C++ is the language that inspired Terrence Parr to add synactic and
semantic predicates to PCCTS/ANTLR (not sure what has been added since
I heard him talk about it in the early 1990s). So basically parser
generators like yacc were not sufficient for the problems posed by the
C++ syntax; when G++ was started in 1987, no free parser generator had
such features, so the decision to go with a hand-written parser is
understandable, and of course since then there were two good reasons
to stay there: 1) Transition costs. 2) Backwards compatibility.
Looking further back, AFAIK Cfront also used a hand-written parser;
this (together with the requirement of C compatibility) resulted in a
syntax design that was not constrained by the parser generators of the
time, but of course also a syntax that was beyond their capabilities.
- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Nope, cfront used a yacc parser with a whole lot of %left and %right
declarations to make shift/reduce warnings go away. Legend says that
much odd C++ syntax is from poorly understood consequences of those
declarations. See it at
http://www.softwarepreservation.org/projects/c_plus_plus/index.html#cfront
-John]
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <480-992-1380@kylheku.com> |
|---|---|
| Date | 2022-01-22 21:38 +0000 |
| Message-ID | <22-01-095@comp.compilers> |
| In reply to | #2855 |
On 2022-01-22, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: > Kaz Kylheku <480-992-1380@kylheku.com> writes: >>None of the buzzwords you mentioned related to parsing are needed in >>building a compiler: grammars, LL, LR, first() function, following() >>function, parsing tables, etc. >> >>All that goes away if you write recursive descent parser. The grammar is >>then in the documentation and code comments only, and somewhat expressed >>in its code structure. There is no LR, first or follow, no tables. > > As our moderator notes, recursive descent is an implementation > technique for LL(1) grammars. Moreover, you need first-sets to > implement them. If your grammar is > > A -> B|C > > your recursive-descent parser becomes: I touched on this point with "All that goes away if you write recursive descent parser. The grammar is then in the documentation and code comments only, and somewhat expressed in its code structure." So it's not that we don't have a grammar or don't know that it's LL; perhaps we do. But the tooling doesn't. > if next_symbol in first(B) then parse_B else parse_C fi In a given parser, there might be no conscious connection to any such theoretical entities, even though the theory readily identifies those elements in the resulting work. (Kind of like a musician who plays by ear is using certain musical devices that he might not be able to name, but music scholars can.) > > If you want to learn about conflicts in your grammar early on (rather > than finding it out later by having some input be parsed other than > intended), you also need follow-sets: if your grammar is > > A -> B|epsilon > > there is a conflict if "intersection(first(b),follow(A))" (where "U" > signifies the union) is not empty: if one of the symbols in the > intersection comes up at the start of an A, do you parse B or not? You can easily parse both ways and pick one. There are readily available backtracking techniques, even in blub languages. Years ago I wrote a C expression parser which did this quite well using exceptions based on setjmp and longjmp. (Because it parsed expression shapes outside of any declaration context, so it woudln't know, say, whether (A)(B) is the expression (B) being cast to type (A), or function (A) being called with argument (B). The goal of the parser was to determine whether the expression could have side effects, with a tri-valued result: no, possibly, yes.) "Parse expression grammars" formalize this more. That sort of approach has to be carefully wielded, because you can get poor performance due to explosion of the search space, with several levels of speculative parsing. > Of course, if you are freshly designing a programming language, you > may be fine with declaring the actual behaviour of the > redursive-descent parser to be the intended behaviour. But if you > have to implement a given grammar, things become more complex. If you have to implement a given grammar, with any tool or approach, you need to have, or develop as you go along, a comprehensive regression test suite, hitting as many of the the corner cases as possible. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[toc] | [prev] | [next] | [standalone]
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-01-22 12:50 +0000 |
| Message-ID | <22-01-090@comp.compilers> |
| In reply to | #2853 |
Kaz Kylheku wrote this about the C++ compiler: > In other words, a major compiler for probably > the programming language with the most > complicated syntax ever, eschews pretty much > all that we have learned and accumulated about > parsing between around 1968 and now. Yikes! They ignored the rich theory and vast set of algorithms, in favor of their own proprietary code? Why would the C++ compiler developers do such a thing? /Roger [My guess is that they were too busy chopping down trees to sharpen their axes. -John]
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <480-992-1380@kylheku.com> |
|---|---|
| Date | 2022-01-22 21:22 +0000 |
| Message-ID | <22-01-094@comp.compilers> |
| In reply to | #2856 |
On 2022-01-22, Roger L Costello <costello@mitre.org> wrote: > Kaz Kylheku wrote this about the C++ compiler: > >> In other words, a major compiler for probably >> the programming language with the most >> complicated syntax ever, eschews pretty much >> all that we have learned and accumulated about >> parsing between around 1968 and now. > > Yikes! > > They ignored the rich theory and vast set of algorithms, in favor of > their own proprietary code? Why would the C++ compiler developers do > such a thing? Wild guess: Suppose that there is a foo_statement which contains a bar_designator, and something is wrong in there. They can fire up gdb, and put a simple breakpoint on bar_designator, feed in the test case and get a call stack in which parse_bar_designator is called by parse_foo_statement. They can examine all the locals, and arguments up the stack. Typically, nothing like this is easily possible with the theoretically-based parser generation tools. And in C++, you're likely going to be debugging parsing quite a bit. The main parser generation tools used by GNU projects are Flex and Bison. These tools are moving targets; especially Bison. The common practice is to ship the generated parser. (In a fundamental compiler project, you have to for other reasons, like the users not having thta tool installed, because maybe they need your compiler to build it: you want as few dependencies as possible.) Now GCC is hacked on quite a bit and has lots of contributors. It would be annoying to tell people "Oh, just use the generated, shipped parsers if you're not touching the grammar; if you need to regenerate, please use the exact version Bison X.Y.Z.". If a parser is hand-written, that whole sort of problem goes away. > /Roger > [My guess is that they were too busy chopping down trees to sharpen their axes. -John] -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[toc] | [prev] | [next] | [standalone]
| From | Ian Lance Taylor <ianlancetaylor@gmail.com> |
|---|---|
| Date | 2022-01-22 15:40 -0800 |
| Message-ID | <22-01-096@comp.compilers> |
| In reply to | #2856 |
On Sat, Jan 22, 2022 at 3:26 PM Roger L Costello <costello@mitre.org> wrote: > > They ignored the rich theory and vast set of algorithms, in favor of their > own > proprietary code? Why would the C++ compiler developers do such a thing? > > /Roger > [My guess is that they were too busy chopping down trees to sharpen their > axes. -John] > The change in GCC away from using bison to a recursive descent parser was a considered decision on the part of the GCC C++ maintainers. I believe that the project started at https://gcc.gnu.org/legacy-ml/gcc/2000-10/msg00573.html and it was committed to the tree in December, 2002. In my experience bison/yacc parsers are really good for knowing the exact language that you are parsing. But it is difficult to make them generate good error messages and it is difficult to debug them. A language like C++ can only be parsed in conjunction with a symbol table, because the parsing depends on the semantic meaning of tokens; that too is difficult to do using yacc. So while I agree that yacc parsers are theoretically superior, I believe that experience shows that they have some problems in practice. In fact, I would be interested in knowing whether there is any generally used C++ compiler that uses a yacc parser. For example, clang also uses a recursive descent parser. Ian
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <480-992-1380@kylheku.com> |
|---|---|
| Date | 2022-01-23 06:17 +0000 |
| Message-ID | <22-01-103@comp.compilers> |
| In reply to | #2861 |
On 2022-01-22, Ian Lance Taylor <ianlancetaylor@gmail.com> wrote: > In my experience bison/yacc parsers are really good for knowing the exact > language that you are parsing. In my experience, you can easily know what language you're processing with Yacc is if you avoid its /ad hoc/ features for suppressing conflict messages. Namely: - %left, %right and %nonassoc declarations for tokens. - %prec in rules. Otherwise, you're likely going to be relying on your regression test cases to inform you what language you're parsing. The exception is that %left, %right are readily understood if they are only used for the operator tokens in the productions for a binary operator expression grammar (that likely being their motivating use case). -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web