Groups > comp.compilers > #2835 > unrolled thread

Are compiler developers light-years ahead of other software development?

Started by	Roger L Costello <costello@mitre.org>
First post	2022-01-16 14:36 +0000
Last post	2022-01-23 06:17 +0000
Articles	11 — 6 participants

Back to article view | Back to comp.compilers

  Are compiler developers light-years ahead of other software development? Roger L Costello <costello@mitre.org> - 2022-01-16 14:36 +0000
    Re: Are compiler developers light-years ahead of other software development? Philipp Klaus Krause <pkk@spth.de> - 2022-01-16 22:13 +0100
    Re: Are compiler developers light-years ahead of other software development? gah4 <gah4@u.washington.edu> - 2022-01-17 07:14 -0800
    Re: Are compiler developers light-years ahead of other software development? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-01-19 21:33 +0000
    Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-22 03:01 +0000
      Re: Are compiler developers light-years ahead of other software development? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-01-22 10:43 +0000
        Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-22 21:38 +0000
      Re: Are compiler developers light-years ahead of other software development? Roger L Costello <costello@mitre.org> - 2022-01-22 12:50 +0000
        Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-22 21:22 +0000
        Re: Are compiler developers light-years ahead of other software development? Ian Lance Taylor <ianlancetaylor@gmail.com> - 2022-01-22 15:40 -0800
          Re: Are compiler developers light-years ahead of other software development? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-01-23 06:17 +0000

#2835 — Are compiler developers light-years ahead of other software development?

From	Roger L Costello <costello@mitre.org>
Date	2022-01-16 14:36 +0000
Subject	Are compiler developers light-years ahead of other software development?
Message-ID	<22-01-059@comp.compilers>

Hello Compiler Experts!

I am learning the algorithms and theory underlying parsing. Wow! I am
discovering that parsing has such a rich theoretical foundation: grammars, LL,
LR, first() function, following() function, parsing tables, etc.

In all the software that I have written, I am lucky if I can use one or two
algorithms that have a theoretical foundation. Mostly, the software is one-off
code.

The mathematician Alfred North Whitehead has this wonderful phrase [1]:

	... the slow accumulation of clear and relevant ideas.

The parsing community has done that - the community has slowly accumulated
clear and relevant ideas.

I can't think of any other branch of computer science where there is such a
rich foundation upon which to develop software.

Are compiler developers light-years ahead of other software development?

/Roger

[1] An Introduction to Mathematics by Alfred North Whitehead, p. 19.

[toc] | [next] | [standalone]

#2837

From	Philipp Klaus Krause <pkk@spth.de>
Date	2022-01-16 22:13 +0100
Message-ID	<22-01-062@comp.compilers>
In reply to	#2835

On 16.01.22 15:36, Roger L Costello wrote:
> I can't think of any other branch of computer science where there is such a
> rich foundation upon which to develop software.

There are others. But a compiler is something that most softare
developers use, and thus are aware of.

[toc] | [prev] | [next] | [standalone]

#2838

From	gah4 <gah4@u.washington.edu>
Date	2022-01-17 07:14 -0800
Message-ID	<22-01-064@comp.compilers>
In reply to	#2835

On Sunday, January 16, 2022 at 9:26:38 AM UTC-8, Roger L Costello wrote:
> Hello Compiler Experts!

> I am learning the algorithms and theory underlying parsing. Wow! I am
> discovering that parsing has such a rich theoretical foundation: grammars, LL,
> LR, first() function, following() function, parsing tables, etc.

(snip)

> I can't think of any other branch of computer science where there is such a
> rich foundation upon which to develop software.

Thinking about D.E.Knuth's "The Art of Computer Programming", there is a whole
book (volume 3), "Sorting and Searching".  A large fraction of important
algorithms rely on the foundation of search algorithms.  (And if you sort, you can
use binary search to find thinks.)

Now, some of the book is based on the now lost art of sorting data on
magnetic tape, where one has only sequential access to most of the data.

Others will tell you that Quicksort is all you need to know, and forget
the rest of the book.

In any case, I vote for sorting and searching algorithms as the rich theoretical
foundation of much of CS.  Hash tables are in important part
of many compilers.  Dynamic programming algorithms are used in
many optimization problems, and fundamental to many algorithms.

The dynamic programming algorithm used by the Unix diff command,
to find an optimal set of differences between two files, originated
from biologists comparing protein sequences.

[toc] | [prev] | [next] | [standalone]

#2847

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2022-01-19 21:33 +0000
Message-ID	<22-01-074@comp.compilers>
In reply to	#2835

Roger L Costello <costello@mitre.org> writes:
>I am learning the algorithms and theory underlying parsing. Wow! I am
>discovering that parsing has such a rich theoretical foundation: grammars, LL,
>LR, first() function, following() function, parsing tables, etc.
>
>In all the software that I have written, I am lucky if I can use one or two
>algorithms that have a theoretical foundation. Mostly, the software is one-off
>code.
...
>Are compiler developers light-years ahead of other software development?

There are multiple facets to this question:

1) Compilers are decades ahead of much other software development: If
you look at early CS curricula (I have looked at one from IIRC 1965),
you see compilers and a few other topics, but many of the topics that
now fill the curricula did not even exist then.  E.g., BNF was
invented before Algol 60 was published in 1960, while the relational
model for data bases is from 1970, the concept of NP-completeness is
from 1971, and RISC architectures (in the form of the IBM 801) were
invented in the mid-1970s.  And if you look at currently hot topics
like deep learning, cloud computing and IoT, their practical relevance
is much younger (although I guess you can point to 1960s work for each
of them that points in that direction).

2) As I have written some time ago:

|Compilers are ideal software projects: You have relatively stable and
|well-defined, partly even formal specifications for the input and the
|output, and lots of interesting subproblems in between.
|
|In particular, the scanning part of compilers has been formalized
|using regular languages and is expressed in regular expressions, which
|have become not only popular for scanning programming languages, but
|for many other tasks as well.  The parsing part has been formalized
|with context-free grammars in (extended) BNF and that has led to the
|development of parser generators that are quite popular in language
|implementations.

There is also similar formalization in instruction selection (although
my impression is that generators are not as widely used there as in
the front end).

However, in between things are more like more common programming, even
though people have tried to propagate the formalization success in
these areas, too.

And I think that points to why we don't see that much of that kind of
formalization in most other areas: context-free grammars are something
that most programming language designers want to use, so developing
methods for parsing such languages and putting them in parser
generators has a significant payoff, while a formal semantics becomes
so inflexible that most programming language designers prefer to stick
to more or less informal semantics (in some cases with some parts
formalized), and then of course you implement the semantics with a
less formal approach, too.  And for most other software, it's similar.

3) The love of formal stuff can lead you astray.  E.g., the first
Fortran compiler did not report syntax errors.

Of course, mistakes also happen in less formalized software; my point
is that if you focus your attention on the nice formal stuff, you can
easily overlook that your software has other aspects, too, that you
need to cover.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[I am reasonably sure the original Fortran compiler reported syntax
errors.  A 1957 paper at bitsavers says it did:
http://bitsavers.org/pdf/ibm/704/FORTRAN_paper_1957.pdf
-John]

[toc] | [prev] | [next] | [standalone]

#2853

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-01-22 03:01 +0000
Message-ID	<22-01-083@comp.compilers>
In reply to	#2835

On 2022-01-16, Roger L Costello <costello@mitre.org> wrote:
> Are compiler developers light-years ahead of other software development?

I'd say it's pretty impressive how, say, the Google Assistant can speak
completely naturally (and in multiple languages). I mean, details like
intonation across entire sentences and such; nothing "robot-like".

Fantastic advances have been done in computer graphics in the last 30
years.

Compilers don't all use fancy algorithms, or not all of them. Fancy
algorithms are specialized, optimized (in particular ways) solutions to
problems that have other solutions that are pedestrian. Sometimes the
other solutions are actually faster on your input cases.

None of the buzzwords you mentioned related to parsing are needed in
building a compiler: grammars, LL, LR, first() function, following()
function, parsing tables, etc.

All that goes away if you write recursive descent parser. The grammar is
then in the documentation and code comments only, and somewhat expressed
in its code structure. There is no LR, first or follow, no tables.

The GNU C++ compiler, undeniably a production compiler and a
major/known/widely-used one, has a big recursive-descent parser
maintained by hand: over a megabyte of code.

In other words, a major compiler for probably the programming language
with the most complicated syntax ever, eschews pretty much all that we
have learned and accumulated about parsing between around 1968 and now.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[In principle a r-d parser recognizes an LL(1) grammar, in practice people
often cheat a little. -John]

[toc] | [prev] | [next] | [standalone]

#2855

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2022-01-22 10:43 +0000
Message-ID	<22-01-088@comp.compilers>
In reply to	#2853

Kaz Kylheku <480-992-1380@kylheku.com> writes:
>None of the buzzwords you mentioned related to parsing are needed in
>building a compiler: grammars, LL, LR, first() function, following()
>function, parsing tables, etc.
>
>All that goes away if you write recursive descent parser. The grammar is
>then in the documentation and code comments only, and somewhat expressed
>in its code structure. There is no LR, first or follow, no tables.

As our moderator notes, recursive descent is an implementation
technique for LL(1) grammars.  Moreover, you need first-sets to
implement them.  If your grammar is

A -> B|C

your recursive-descent parser becomes:

if next_symbol in first(B) then parse_B else parse_C fi

If you want to learn about conflicts in your grammar early on (rather
than finding it out later by having some input be parsed other than
intended), you also need follow-sets: if your grammar is

A -> B|epsilon

there is a conflict if "intersection(first(b),follow(A))" (where "U"
signifies the union) is not empty: if one of the symbols in the
intersection comes up at the start of an A, do you parse B or not?

Of course, if you are freshly designing a programming language, you
may be fine with declaring the actual behaviour of the
redursive-descent parser to be the intended behaviour.  But if you
have to implement a given grammar, things become more complex.

You can also design the syntax to make recursive-descent parsing easy,
like Wirth did.  E.g., almost all statements start with a keyword, and
there is AFAIK only one statement that may be started by several
symbols (none of them conflicting with one of the other statement
keywords), so you can easily implement statements like this:

if next_symbol=IF_SYM then ifstatement()
else if next_symbol=WHILE_SYM then whilestatement()
else if ...
...
else assignmentstatement()

All statements are followed by ";" or END, making it easy to know that
there are no conflicts with following statements.

These properties also make it easier for humans to understand
grammars, so they are a good idea in syntax design.  If you go further
in that direction, you get to Lisp (which does not need a BNF-based
parser) or Forth (where the parser just looks for the next white
space).  Of course, if you look at textbooks for these languages, you
find that you have patterns at a different ("static semantics") level,
but at least it's always clear in these language from the syntax which
if an else-branch belongs to.

>The GNU C++ compiler, undeniably a production compiler and a
>major/known/widely-used one, has a big recursive-descent parser
>maintained by hand: over a megabyte of code.
>
>In other words, a major compiler for probably the programming language
>with the most complicated syntax ever, eschews pretty much all that we
>have learned and accumulated about parsing between around 1968 and now.

C++ is the language that inspired Terrence Parr to add synactic and
semantic predicates to PCCTS/ANTLR (not sure what has been added since
I heard him talk about it in the early 1990s).  So basically parser
generators like yacc were not sufficient for the problems posed by the
C++ syntax; when G++ was started in 1987, no free parser generator had
such features, so the decision to go with a hand-written parser is
understandable, and of course since then there were two good reasons
to stay there: 1) Transition costs.  2) Backwards compatibility.

Looking further back, AFAIK Cfront also used a hand-written parser;
this (together with the requirement of C compatibility) resulted in a
syntax design that was not constrained by the parser generators of the
time, but of course also a syntax that was beyond their capabilities.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Nope, cfront used a yacc parser with a whole lot of %left and %right
declarations to make shift/reduce warnings go away. Legend says that
much odd C++ syntax is from poorly understood consequences of those
declarations. See it at
http://www.softwarepreservation.org/projects/c_plus_plus/index.html#cfront
-John]

[toc] | [prev] | [next] | [standalone]

#2860

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-01-22 21:38 +0000
Message-ID	<22-01-095@comp.compilers>
In reply to	#2855

On 2022-01-22, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> Kaz Kylheku <480-992-1380@kylheku.com> writes:
>>None of the buzzwords you mentioned related to parsing are needed in
>>building a compiler: grammars, LL, LR, first() function, following()
>>function, parsing tables, etc.
>>
>>All that goes away if you write recursive descent parser. The grammar is
>>then in the documentation and code comments only, and somewhat expressed
>>in its code structure. There is no LR, first or follow, no tables.
>
> As our moderator notes, recursive descent is an implementation
> technique for LL(1) grammars.  Moreover, you need first-sets to
> implement them.  If your grammar is
>
> A -> B|C
>
> your recursive-descent parser becomes:

I touched on this point with "All that goes away if you write recursive
descent parser. The grammar is then in the documentation and code
comments only, and somewhat expressed in its code structure."

So it's not that we don't have a grammar or don't know that it's LL;
perhaps we do. But the tooling doesn't.

> if next_symbol in first(B) then parse_B else parse_C fi

In a given parser, there might be no conscious connection to any such
theoretical entities, even though the theory readily identifies those
elements in the resulting work.

(Kind of like a musician who plays by ear is using certain musical
devices that he might not be able to name, but music scholars can.)
>
> If you want to learn about conflicts in your grammar early on (rather
> than finding it out later by having some input be parsed other than
> intended), you also need follow-sets: if your grammar is
>
> A -> B|epsilon
>
> there is a conflict if "intersection(first(b),follow(A))" (where "U"
> signifies the union) is not empty: if one of the symbols in the
> intersection comes up at the start of an A, do you parse B or not?

You can easily parse both ways and pick one. There are readily available
backtracking techniques, even in blub languages.

Years ago I wrote a C expression parser which did this quite well using
exceptions based on setjmp and longjmp.

(Because it parsed expression shapes outside of any declaration context,
so it woudln't know, say, whether (A)(B) is the expression (B) being
cast to type (A), or function (A) being called with argument (B). The
goal of the parser was to determine whether the expression could have
side effects, with a tri-valued result: no, possibly, yes.)

"Parse expression grammars" formalize this more.

That sort of approach has to be carefully wielded, because you can
get poor performance due to explosion of the search space, with several
levels of speculative parsing.

> Of course, if you are freshly designing a programming language, you
> may be fine with declaring the actual behaviour of the
> redursive-descent parser to be the intended behaviour.  But if you
> have to implement a given grammar, things become more complex.

If you have to implement a given grammar, with any tool or approach,
you need to have, or develop as you go along, a comprehensive regression
test suite, hitting as many of the the corner cases as possible.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

[toc] | [prev] | [next] | [standalone]

#2856

From	Roger L Costello <costello@mitre.org>
Date	2022-01-22 12:50 +0000
Message-ID	<22-01-090@comp.compilers>
In reply to	#2853

Kaz Kylheku wrote this about the C++ compiler:

> In other words, a major compiler for probably
> the programming language with the most
> complicated syntax ever, eschews pretty much
> all that we have learned and accumulated about
> parsing between around 1968 and now.

Yikes!

They ignored the rich theory and vast set of algorithms, in favor of their own
proprietary code? Why would the C++ compiler developers do such a thing?

/Roger
[My guess is that they were too busy chopping down trees to sharpen their axes. -John]

[toc] | [prev] | [next] | [standalone]

#2859

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-01-22 21:22 +0000
Message-ID	<22-01-094@comp.compilers>
In reply to	#2856

On 2022-01-22, Roger L Costello <costello@mitre.org> wrote:
> Kaz Kylheku wrote this about the C++ compiler:
>
>> In other words, a major compiler for probably
>> the programming language with the most
>> complicated syntax ever, eschews pretty much
>> all that we have learned and accumulated about
>> parsing between around 1968 and now.
>
> Yikes!
>
> They ignored the rich theory and vast set of algorithms, in favor of
> their own proprietary code? Why would the C++ compiler developers do
> such a thing?

Wild guess:

Suppose that there is a foo_statement which contains a bar_designator,
and something is wrong in there. They can fire up gdb, and put a simple
breakpoint on bar_designator, feed in the test case and get a call stack
in which parse_bar_designator is called by parse_foo_statement.
They can examine all the locals, and arguments up the stack.

Typically, nothing like this is easily possible with the
theoretically-based parser generation tools.

And in C++, you're likely going to be debugging parsing quite a bit.

The main parser generation tools used by GNU projects are Flex and
Bison. These tools are moving targets; especially Bison. The common
practice is to ship the generated parser.

(In a fundamental compiler project, you have to for other reasons, like
the users not having thta tool installed, because maybe they need your
compiler to build it: you want as few dependencies as possible.)

Now GCC is hacked on quite a bit and has lots of contributors. It would
be annoying to tell people "Oh, just use the generated, shipped
parsers if you're not touching the grammar; if you need to regenerate,
please use the exact version Bison X.Y.Z.".

If a  parser is hand-written, that whole sort of problem goes away.

> /Roger
> [My guess is that they were too busy chopping down trees to sharpen their axes. -John]

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

[toc] | [prev] | [next] | [standalone]

#2861

From	Ian Lance Taylor <ianlancetaylor@gmail.com>
Date	2022-01-22 15:40 -0800
Message-ID	<22-01-096@comp.compilers>
In reply to	#2856

On Sat, Jan 22, 2022 at 3:26 PM Roger L Costello <costello@mitre.org> wrote:

>
> They ignored the rich theory and vast set of algorithms, in favor of their
> own
> proprietary code? Why would the C++ compiler developers do such a thing?
>
> /Roger
> [My guess is that they were too busy chopping down trees to sharpen their
> axes. -John]
>

The change in GCC away from using bison to a recursive descent parser was a
considered decision on the part of the GCC C++ maintainers.  I believe that
the project started at
https://gcc.gnu.org/legacy-ml/gcc/2000-10/msg00573.html and it was
committed to the tree in December, 2002.

In my experience bison/yacc parsers are really good for knowing the exact
language that you are parsing.  But it is difficult to make them generate
good error messages and it is difficult to debug them.  A language like C++
can only be parsed in conjunction with a symbol table, because the parsing
depends on the semantic meaning of tokens; that too is difficult to do
using yacc.  So while I agree that yacc parsers are theoretically superior,
I believe that experience shows that they have some problems in practice.

In fact, I would be interested in knowing whether there is any generally
used C++ compiler that uses a yacc parser.  For example, clang also uses a
recursive descent parser.

Ian

[toc] | [prev] | [next] | [standalone]

#2864

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-01-23 06:17 +0000
Message-ID	<22-01-103@comp.compilers>
In reply to	#2861

On 2022-01-22, Ian Lance Taylor <ianlancetaylor@gmail.com> wrote:
> In my experience bison/yacc parsers are really good for knowing the exact
> language that you are parsing.

In my experience, you can easily know what language you're processing
with Yacc is if you avoid its /ad hoc/ features for suppressing conflict
messages. Namely:

- %left, %right and %nonassoc declarations for tokens.

- %prec in rules.

Otherwise, you're likely going to be relying on your regression test
cases to inform you what language you're parsing.

The exception is that %left, %right are readily understood if they are
only used for the operator tokens in the productions for a binary
operator expression grammar (that likely being their motivating use case).

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

[toc] | [prev] | [standalone]

csiph-web

Are compiler developers light-years ahead of other software development?

Contents

#2835 — Are compiler developers light-years ahead of other software development?

#2837

#2838

#2847

#2853

#2855

#2860

#2856

#2859

#2861

#2864