Groups > comp.compilers > #3459 > unrolled thread

binary search debugging of compilers

Started by	Russ Cox <rsc@swtch.com>
First post	2023-05-12 13:59 -0400
Last post	2023-05-15 21:35 +0000
Articles	17 — 8 participants

Back to article view | Back to comp.compilers

  binary search debugging of compilers Russ Cox <rsc@swtch.com> - 2023-05-12 13:59 -0400
    Re: binary search debugging of compilers Kaz Kylheku <864-117-4973@kylheku.com> - 2023-05-13 03:20 +0000
    Re: binary search debugging of compilers Fernando <pronesto@gmail.com> - 2023-05-13 04:47 -0700
      Re: binary search debugging of compilers Kaz Kylheku <864-117-4973@kylheku.com> - 2023-05-14 02:49 +0000
        Re: binary search debugging of compilers gah4 <gah4@u.washington.edu> - 2023-05-14 13:38 -0700
          Re: binary search debugging of compilers Kaz Kylheku <864-117-4973@kylheku.com> - 2023-05-15 21:52 +0000
            Re: binary search debugging of compilers gah4 <gah4@u.washington.edu> - 2023-05-16 23:52 -0700
              Re: binary search debugging of compilers Kaz Kylheku <864-117-4973@kylheku.com> - 2023-05-17 18:28 +0000
                Re: binary search debugging of compilers gah4 <gah4@u.washington.edu> - 2023-05-17 15:23 -0700
                  Re: binary search debugging of compilers Kaz Kylheku <864-117-4973@kylheku.com> - 2023-05-19 03:21 +0000
                    Re: binary search debugging of compilers Thomas Koenig <tkoenig@netcologne.de> - 2023-05-19 21:59 +0000
                      Re: binary search debugging of compilers gah4 <gah4@u.washington.edu> - 2023-05-20 20:20 -0700
                        Re: Old C compilers, binary search debugging of compilers Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2023-05-22 09:05 +0200
                binary search debugging of compilers Max B <tekk.nolagi@gmail.com> - 2023-05-19 16:31 -0500
    Re: binary search debugging of compilers Thomas Koenig <tkoenig@netcologne.de> - 2023-05-14 19:59 +0000
      Re: binary search debugging of compilers Cameron McInally <cameron.mcinally@nyu.edu> - 2023-05-14 23:28 -0400
      Re: binary search debugging of compilers Kaz Kylheku <864-117-4973@kylheku.com> - 2023-05-15 21:35 +0000

#3459 — binary search debugging of compilers

From	Russ Cox <rsc@swtch.com>
Date	2023-05-12 13:59 -0400
Subject	binary search debugging of compilers
Message-ID	<23-05-003@comp.compilers>

Hi all,

In the Go compiler, we have been using a technique for the past eight
years that I assume has been done in other compilers before, but I'm
unaware of any. I describe it below, along with refinements we've made
over the years. I'd be very interested to hear about earlier uses of
approaches like this, or of any other ideas for use cases.

In 2015, Keith Randall was working on a new SSA backend for the Go
compiler. The old and new backends coexisted, and we could use either
for any given function in the compilation. Keith introduced an
environment variable GOSSAHASH that specifies the last few binary
digits of the hash of function names that should use the new backend.
So GOSSAHASH=0110 means compile only those functions whose names hash
to a value with last four bits 0110. When a test is failing with the
new backend, you try GOSSAHASH=0 and GOSSAHASH=1, and then use binary
search to progressively refine the pattern, narrowing the failure down
until only a single function is being compiled with the new backend.
This was invaluable for approaching failures in large real-world tests
(tests for libraries or production code, not for the compiler) that we
had not written and did not understand.

In 2016, David Chase was debugging a new optimizer rewrite rule that
should have been correct but was causing mysterious test failures.
He reused the same technique at finer granularity: the bit pattern now
matched against a hash of the current file name and line number,
so that the binary search pinpoints the exact line of code that is
miscompiled (meaning causes a test failure) when the suspect rewrite
rule is used.

Since then we have continued to find uses for this approach.
For example, when we added automatic FMA insertion on certain
architectures, some tests started failing. By making the rewrite rule
controlled by a file:line hash pattern, we can pinpoint the exact line
or lines where FMA insertion causes the failure.

As another example, we are considering a small change to the semantics
of variables declared in for loops, along the lines of what C# 5 did
and what JavaScript does in for-let loops. By using a hash pattern to
toggle whether the new semantics applies at a given file:line, we can
identify the specific loops where the semantic change provokes a
test failure.

The binary search does not have to stop at a single failure. With a
richer pattern syntax adding an AND-NOT operator, we can disable the
instances we've found, and if the test still fails, run a new binary
search to find another culprit. (Technically, AND-NOT is not necessary
for doing a complete binary traversal of the search space guided by
single failures, but it is easy, and it is necessary for the next step.)

The binary search can also be adapted to finding pairs or larger
groups of changes, all of which are necessary to provoke a failure.
If suffix B provokes the failure but neither 0B nor 1B do, then
(assuming the test is not flaky!) you start using the pattern “0B OR 1B”,
refine the left half of the expression with binary search
(for example the first step checks “00B OR 1B” versus “10B OR 1B”),
and then having refined the left half, move on to refining the right half.
After refining both, you have an OR of patterns that each match one
change, and all the changes are necessary to provoke the failure.
The splitting naturally recurses as needed to find a locally minimal set
of changes, in the sense that by construction, removing any single
change would not provoke the failure.

Stepping slightly away from compilers temporarily, we have recently
realized that this technique can be adapted to identifying specific
bug-inducing contexts for library functions as well. For example,
suppose we want to introduce a new qsort implementation, but that
causes tests to fail in a large program (maybe even a compiler) that
makes use of qsort in many contexts. Binary search selecting the old
or new algorithm according to a hash of the call stack will quickly
identify the exact call stack affected by the new qsort.

As I mentioned at the top, I am interested to hear about earlier uses
of approaches like this, or any other ideas for situations where the
approach might be applicable. Clearly the general problem has overlaps
with group testing [1], and in particular Hwang's binary search-based
approach (1972) [2]. I've also found one paper describing this
algorithm in the context of identifying which of a large set of
changes between GDB versions caused a crash (1999) [3].
(That paper's version of the algorithm for finding pairs or larger
groups is buggy unless there is only one such group.)

It seems like the idea of binary search to narrow down a buggy
software change is so natural that there must have been many earlier
uses for debugging before 1999, especially in compilers.

Thanks for any references, interesting history, or ideas.

If anyone is interested in code, we've published the binary search tool [4]
and the code that interprets the patterns [5].

Best,
Russ

[1] https://en.wikipedia.org/wiki/Group_testing (a remarkably good article)
[2] https://www.jstor.org/stable/2284447
[3] https://dl.acm.org/doi/10.1145/318774.318946
[4] https://pkg.go.dev/golang.org/x/tools/cmd/bisect
[5] https://pkg.go.dev/golang.org/x/tools/internal/bisect

[toc] | [next] | [standalone]

#3460

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-05-13 03:20 +0000
Message-ID	<23-05-004@comp.compilers>
In reply to	#3459

On 2023-05-12, Russ Cox <rsc@swtch.com> wrote:
> In 2015, Keith Randall was working on a new SSA backend for the Go
> compiler. The old and new backends coexisted, and we could use either
> for any given function in the compilation. Keith introduced an
> environment variable GOSSAHASH that specifies the last few binary
> digits of the hash of function names that should use the new backend.
> So GOSSAHASH=0110 means compile only those functions whose names hash
> to a value with last four bits 0110. When a test is failing with the
> new backend, you try GOSSAHASH=0 and GOSSAHASH=1, and then use binary
> search to progressively refine the pattern, narrowing the failure down
> until only a single function is being compiled with the new backend.
> This was invaluable for approaching failures in large real-world tests
> (tests for libraries or production code, not for the compiler) that we
> had not written and did not understand. ...

> It seems like the idea of binary search to narrow down a buggy
> software change is so natural that there must have been many earlier
> uses for debugging before 1999, especially in compilers.

I have used binary searching to find bugs in the TXR Lisp compiler.

Say for the sake of simplicity that I already have a pretty reliable
idea in which file the miscompiled function resides.

Because the compiler executes the top-level forms that it compiles,
I can go to 50% of that file (literally by typing 50% in Vim).
Then stick a (setf *opt-level* 0) at that line. The bottom half
of the file is then compiled without optimizations. If the bug goes
away, then something is being mis-optimized in the second half.
And so it goes.

Another trick I have used is to suppress the compilation of
specific forms according to this pattern:

  (defun fun (...) ...)   --> (eval '(defun fun () ...))

I.e. take the form, and eval its quoted version:

   form -> (eval 'form)

So the compiler now is compiling nothing more than an eval call which
takes a canned code literal as its argument. When the resulting compiled
file is loaded, fun will be an interpreted function, not a compiled
fucntion: the compiled version of the eval call executes, interpreting
the defun form, resulting in an interpreted function being defined.  If
the bug goes away, that function was being miscompiled.

(This works in Lisp dialects that have a pure interpreter behind eval;
dialects whose eval is actually a compiler may not support this
technique well.)

The hash technique seems nice. If we have no idea where in the
code base the miscompilation is happening, it seems like it could help;
it seems easier than some of the following techniques.

I often have a good idea where in the code base the problem is because
during the rebuild of TXR, files are compiled one by one. As each file
in the stdlib/ is compiled, the next compile job now uses that file (if
it is implied for loading).  So more ofen than not, as soon as we
miscompile some file that used during compilation, the compilation of
the next file will misbehave. Or possibly the same file. The compiler
evaluates what it compiles (unless told not to).  If the compiler is
compiling some code that it's also itself using, it may misbehave soon
after miscompiling some of it, before that file is done compiling.

A useful technique used sometimes is to substitute known-good compiled
files. If the problem goes away if we substitute a known-good
compiled file (touching the timestamps so that it's not clobbered
with a recompile), we can suspect that the file is being miscompiled.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3461

From	Fernando <pronesto@gmail.com>
Date	2023-05-13 04:47 -0700
Message-ID	<23-05-005@comp.compilers>
In reply to	#3459

Hi Russ, that's very interesting. You should consider submitting a report of
the technique to CGO! The next deadline is on the 19th, but then there will be
another deadline on September 1st.

As for a related work, see John Regehr's research on test case reduction for
compilers [1].

[1] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, Xuejun Yang:
Test-case reduction for C compiler bugs. PLDI 2012: 335-346

Regards,

Fernando

[toc] | [prev] | [next] | [standalone]

#3462

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-05-14 02:49 +0000
Message-ID	<23-05-006@comp.compilers>
In reply to	#3461

On 2023-05-13, Fernando <pronesto@gmail.com> wrote:
> Hi Russ, that's very interesting. You should consider submitting a report of
> the technique to CGO! The next deadline is on the 19th, but then there will be
> another deadline on September 1st.

We can summarize it briefly.

Background:

Suppose we have some data processing technique which transforms N
entities. A new version of the technique changes how all the entities
are transformed. One or more of them are transformed erroneously,
resulting in a system regression (one which doesn't readily identify
the bad element). Because all elements are transformed with the new
technique, there are big differences in all of them compared to the
old technique: comparing old and new is futile.

Setup:

We can (somehow) enumerate the N entities with integers, which we
express using a pure binary enumeration. Then, we can discover the
number of an erroneously processed element, one bit at a time.

Procedure:

First we treat all elements numbered ..XXXXXX0 using the new technique
technique, and all others using the the old technique. If the
system fails, we know that it's the ..XXXXX0 set which has one or more
badly processed elements. Otherwise it's the other set, whose
enumerations end in a 1. Either way, we have discovered the identity of
the last bit.

Suppose that discovered bit is 1. We then proceed with the next bit: we
process ..XXXXXX01 entities with the new technique and all others
with the old technique. If the problem reproduces, we have narrowed
to the ...XXXXX01 set, otherwise to the .XXXXX11 set.

In this manner we discover every bit, at which point we have
a useful piece of information: the identity of a misprocessed element.
We can then focus the investigation on how it is misprocessed.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3464

From	gah4 <gah4@u.washington.edu>
Date	2023-05-14 13:38 -0700
Message-ID	<23-05-008@comp.compilers>
In reply to	#3462

On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote:

(snip)

> Setup:
>
> We can (somehow) enumerate the N entities with integers, which we
> express using a pure binary enumeration. Then, we can discover the
> number of an erroneously processed element, one bit at a time.
>
> Procedure:
>
> First we treat all elements numbered ..XXXXXX0 using the new technique
> technique, and all others using the the old technique. If the
> system fails, we know that it's the ..XXXXX0 set which has one or more
> badly processed elements. Otherwise it's the other set, whose
> enumerations end in a 1. Either way, we have discovered the identity of
> the last bit.

There is the assumtion that only one of the N is in error.

Reminds me of ECC memory, which can correct one bit errors, and detect
two bit errors. After log(N) tests, you find the one that it is
supposed to be, fix it, and it either works or doesn't.

Then you work on a new set of tests.

Or start out with a more complicated set, which can learn more in one set.

[toc] | [prev] | [next] | [standalone]

#3467

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-05-15 21:52 +0000
Message-ID	<23-05-011@comp.compilers>
In reply to	#3464

On 2023-05-14, gah4 <gah4@u.washington.edu> wrote:
> On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote:
>
> (snip)
>
>> Setup:
>>
>> We can (somehow) enumerate the N entities with integers, which we
>> express using a pure binary enumeration. Then, we can discover the
>> number of an erroneously processed element, one bit at a time.
>>
>> Procedure:
>>
>> First we treat all elements numbered ..XXXXXX0 using the new technique
>> technique, and all others using the the old technique. If the
>> system fails, we know that it's the ..XXXXX0 set which has one or more
>> badly processed elements. Otherwise it's the other set, whose
>> enumerations end in a 1. Either way, we have discovered the identity of
>> the last bit.
>
> There is the assumption that only one of the N is in error.

You might think, but it's not actually a problem with the algorithm.

When we bisect a change history, there is the assumption that the
bug showed up, so there are good commits before that, and bad commits
after. If the history is not like that: the bug disappears and
reappears, then the bisect will not identify *the* bad commit reliably,
because there isn't one.

I have a story like this; but it digresses from my present point, so
I will return to it below.

In this situation, we are bisecting a space of N entities. All we need
is to find one bad one. If there are three bad ones, it doesn't matter
which one we find.

Say we have a basket of apples with a rotten smell emanating from it.
We can subdivide it, and smell one half and the other. If both halves
smell, we know we have two or more rotten apples and they ended up
in different halves. This doesn't matter. We just pick one half and
keep subdividing. As long as we stay on the trail of the rotten scent, we will
get down to one rotten apple, and we can use that apple to analyze further:
what kind of mould or bacterium has infected it and so on. Probably
the other rotten apples have the same root cause. If they have different root
causes, we can do another search after fixing the one root cause we have found.

Now about the story. The conclusion of the story was that there was
a GC-related bug in an ash (arithmetic shift) function, dealing with
heap-allocated bignum integers.

A bignum integer was being prematurely reclaimed by the garbage
collector.

The fix looked like this:

commit 9cf992835c8a0f2e6c4ace07b67fea2acb762cc5
Author: Kaz Kylheku <kaz@kylheku.com>
Date:   Wed Jun 19 23:21:39 2019 -0700

    ash: gc problem.

    On 32 bit x86 Solaris 10, with gcc 4.9.0, this issue caused a
    miscompilation of the pset macro, due to ash abruptly
    returning zero, causing the op-code field of an instruction to
    be NOOP rather than GCALL.

    I'm committing this fix just for reference; I will immediately
    replace it with a refactoring of the function.

    * arith.c (ash): Add a gc_hint to prevent the a bignum from
    being reclaimed if the b = make_bignum() triggers gc.
    This happens in the case when the previous case computes
    a = make_bignum() and falls through.

diff --git a/arith.c b/arith.c
index fdb294b7..f374ddf4 100644
--- a/arith.c
+++ b/arith.c
@@ -3235,6 +3235,7 @@ val ash(val a, val bits)
       b = make_bignum();
       if ((mpe = mp_shift(mp(a), mp(b), bn)) != MP_OKAY)
         break;
+      gc_hint(a);
       return normalize(b);
     default:
       goto bad3;

The problem only showed upon Solaris (first red flag).  When I git bisected it,
it was randomly appearing and disappearing, so git bisect was of no use.

Not all problems introduce themselves in a way that there is a reproducible
test case.

Here, the garbage collector had to go off exactly at the wrong time during the
execution of the ash function. Small perturbations to completely unrelated code
can make it go away.

In the end, I had to debug it on Solaris (no working gdb or anything), using
the repro that I had.

Once I discovered the bad VM instruction, which was affecting the behavior of a
certain Lisp macro in the standard library, I traced it through the compiler
right down to the assembler, where the opcode field for a good assembly
instruction was being miscalculated. I could not reproduce the bad ash
calculation though in isolation.

In summary:

1. non-gc-aware, sloppy coding in ash function caused ...
2. rare problem in assembler's masking and shifting calculations, causing ...
3. Bad instruction when compiling the "pset" (parallel assignment) macro ...
4. Which failed some code relying pset.

All of this reproduced only one one platform, one particular commit.
Small perturbations of Lisp code just about anywhere made it go away.
Bisect was useless.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3468

From	gah4 <gah4@u.washington.edu>
Date	2023-05-16 23:52 -0700
Message-ID	<23-05-012@comp.compilers>
In reply to	#3467

On Monday, May 15, 2023 at 6:54:08 PM UTC-7, Kaz Kylheku wrote:

(snip)

> Say we have a basket of apples with a rotten smell emanating from it.
> We can subdivide it, and smell one half and the other. If both halves
> smell, we know we have two or more rotten apples and they ended up
> in different halves. This doesn't matter. We just pick one half and
> keep subdividing.

But the algorithm described, at least as I remember it, doesn't test both
halves.

Sniffing two baskets doesn't take so long, but two tests might.

I was suspecting that there were more efficient ways than doing
both halves, though didn't try to figure out what they might be.

> As long as we stay on the trail of the rotten scent, we will
> get down to one rotten apple, and we can use that apple to analyze further:
> what kind of mould or bacterium has infected it and so on. Probably
> the other rotten apples have the same root cause. If they have different
root
> causes, we can do another search after fixing the one root cause we have
found.

I don't know apple statistics so well.  If you suspect more than one from the
beginning,

As a binary search, it should be 50% probability on each test.
If you see higher than 50% as the tests go on, it might look
suspicious already.

[toc] | [prev] | [next] | [standalone]

#3471

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-05-17 18:28 +0000
Message-ID	<23-05-015@comp.compilers>
In reply to	#3468

On 2023-05-17, gah4 <gah4@u.washington.edu> wrote:
> On Monday, May 15, 2023 at 6:54:08 PM UTC-7, Kaz Kylheku wrote:
>
> (snip)
>
>> Say we have a basket of apples with a rotten smell emanating from it.
>> We can subdivide it, and smell one half and the other. If both halves
>> smell, we know we have two or more rotten apples and they ended up
>> in different halves. This doesn't matter. We just pick one half and
>> keep subdividing.
>
> But the algorithm described, at least as I remember it, doesn't test both
> halves.

If the problem reproducews with the half you're testing, you assume
the misprocessed function is in that half. (There could be a bad
function in the other half also, and you could test that, but it's
unnecessary.)

If the problem deosn't reproduce in the half you're testing, you
assume the misprocessed functions lie in the other half, and
recurse on that half instead.

The algorithm will work even if every single function is miscompiled.
(And will take the same logarithmic number of steps, since we cut
in half no matter what.)

It will arbitrarily narrow it down to one, showing that even if the
compiler change is enabled for just one function, there is a problem.
That function can be investigated. Then maybe then the fix solve the
issue for the five hundred other broken functions, too.

The main problem in that situation is that the function you narrow it
down to may not be the easiest one to investigate. Say that a
three line function is broken, and a 1000 line function is broken;
the binary search finds the 1000 line function, whereas it would be
easier to investigate how the three line function is miscompiled.

There could be the following issue: among several functions that
are miscompiled, there are two: foo and bar. *Both* must be
miscompiled for the problem to reproduce, due to an interaction.
If there is a recursive subdivision such that foo and bar are in
different halves, then the problem doesn't reproduce for either half.

Therefore, when the problem fails to reproduce for the currently
tested half, it may be wise to cross-test the other half. If it also
doesn't reproduce, the halves have to be recombined and some other
approach taken, like perturbing the enumeration so that the partitioning
is done differently, or doing a slow linear elimination.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3473

From	gah4 <gah4@u.washington.edu>
Date	2023-05-17 15:23 -0700
Message-ID	<23-05-017@comp.compilers>
In reply to	#3471

On Wednesday, May 17, 2023 at 12:08:27 PM UTC-7, Kaz Kylheku wrote:
> On 2023-05-17, gah4 <ga...@u.washington.edu> wrote:
> > On Monday, May 15, 2023 at 6:54:08 PM UTC-7, Kaz Kylheku wrote:
> >
> > (snip)
> >
> >> Say we have a basket of apples with a rotten smell emanating from it.
> >> We can subdivide it, and smell one half and the other. If both halves
> >> smell, we know we have two or more rotten apples and they ended up
> >> in different halves. This doesn't matter. We just pick one half and
> >> keep subdividing.

> > But the algorithm described, at least as I remember it, doesn't test both
> > halves.

> If the problem reproducews with the half you're testing, you assume
> the misprocessed function is in that half. (There could be a bad
> function in the other half also, and you could test that, but it's
> unnecessary.)
>
> If the problem deosn't reproduce in the half you're testing, you
> assume the misprocessed functions lie in the other half, and
> recurse on that half instead.

I was mostly noting it, as the apple test was written to test
both halves.

> The algorithm will work even if every single function is miscompiled.
> (And will take the same logarithmic number of steps, since we cut
> in half no matter what.)

> It will arbitrarily narrow it down to one, showing that even if the
> compiler change is enabled for just one function, there is a problem.
> That function can be investigated. Then maybe then the fix solve the
> issue for the five hundred other broken functions, too.

As to the problem of debugging by miscompiled code, this reminds
me of stories of debugging compilers by feeding them cards from
the card recycling bin. It seems that cards were especially high grade
paper and so worth keeping separate for recycling.  I never actually
saw anyone do that, though.

In my younger days, some said that I was especially good at finding
compiler errors. Maybe not as good now. I would try things that were
allowed, but others might not have thought about doing.

Only one I remember now, is a C compiler that miscompiled the ++
operator on a double variable.  Allowed in C, but maybe not often used.

[toc] | [prev] | [next] | [standalone]

#3476

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-05-19 03:21 +0000
Message-ID	<23-05-020@comp.compilers>
In reply to	#3473

On 2023-05-17, gah4 <gah4@u.washington.edu> wrote:
> As to the problem of debugging by miscompiled code, this reminds
> me of stories of debugging compilers by feeding them cards from
> the card recycling bin. It seems that cards were especially high grade
> paper and so worth keeping separate for recycling.  I never actually
> saw anyone do that, though.

Obviously, the card recycling bin is going to be a source of garbage
that is invalid inputs to the compiler, with perhaps some valid bits.

This was effectively an early form of fuzzing.

> In my younger days, some said that I was especially good at finding
> compiler errors. Maybe not as good now.

Fuzzing has matured now. There are tools now which instrument the
executable, and monitor what is going on in terms of the program
control flow in response to random inputs. Then they adjust the bits
to try to stimulate paths not taken to tickle bugs. That's the
jist of it.

> I would try things that were
> allowed, but others might not have thought about doing.

I've not used fuzzing in a while ... since 2016 it turns out!  I played with it
in the summer of that year. The software was AFL (American Fuzzy Lop) 2.30b.

I had found three bugs in a short timespan back then.

One of them was exponential memory explosion in nested quasiquoting syntax like
^^^....^^exr. (TXR Lisp's quasiquoting operator is written ^ rather than the
usual backtick ` used in most traditional Lisp dialects, otherwise it is the
same thing).

AFL also discovered that if it puts a huge number into the op syntax like this:
(op list @123455234), the op expander will obligingly generate a lambda with
that many arguments!  E.g. (op list @3) generates something similar to
(lambda (arg0 arg1 arg3) (list arg3)).  So imagine if we replace @3 with a
large integer.

AFL also found a crash in the regex parser. It was not doing a range check on
numeric character escapes, making it possible for the program to sneak a
negatively valued character code into the regex copiler, where it resulted in
an out-of-bounds array access problem.

I have been meaning brush the dust off this technique again.

> Only one I remember now, is a C compiler that miscompiled the ++
> operator on a double variable.  Allowed in C, but maybe not often used.

Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
I doubt that if such a bug were introduced as an easter egg into GCC,
it would go very long without being discovered by the FOSS distros and
other downstream users.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3481

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2023-05-19 21:59 +0000
Message-ID	<23-05-025@comp.compilers>
In reply to	#3476

Kaz Kylheku <864-117-4973@kylheku.com> schrieb:

> Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
> I doubt that if such a bug were introduced as an easter egg into GCC,
> it would go very long without being discovered by the FOSS distros and
> other downstream users.

It would very likely be caught by regression testing.  GCC has an
extensive test suite, and it is a requirement that this is run
before submitting a patch.

If that slips through (as can happen from time to time, for
example for different architectures or operating systems), then
the automated regression testers will flag it.

If not that, the automated SPEC testers also have a good chance
of catching it.

As an aside, the post-increment operator is converted to its
equivalent assignment statement very early, during gimplification.

In the following example, the result of the first two passes that
GCC performs are shown - the first is a reproduction of the source
code, as parsed, and the second one after lowering to GIMPLE.
File names may differ bit depending on whic version of GCC you use.

$ cat double.c
double c;

void foo()
{
        c++;
}
$ gcc -c -fdump-tree-original -fdump-tree-gimple double.c
$ cat double.c.004t.original

;; Function foo (null)
;; enabled by -tree-original


{
  c++ ;
}

$ cat double.c.005t.gimple
foo ()
{
  c.0_1 = c;
  _2 = c.0_1 + 1.0e+0;
  c = _2;
}

[toc] | [prev] | [next] | [standalone]

#3483

From	gah4 <gah4@u.washington.edu>
Date	2023-05-20 20:20 -0700
Message-ID	<23-05-027@comp.compilers>
In reply to	#3481

On Saturday, May 20, 2023 at 3:05:13 PM UTC-7, Thomas Koenig wrote:
> Kaz Kylheku <864-11...@kylheku.com> schrieb:
> > Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
> > I doubt that if such a bug were introduced as an easter egg into GCC,
> > it would go very long without being discovered by the FOSS distros and
> > other downstream users.

> It would very likely be caught by regression testing. GCC has an
> extensive test suite, and it is a requirement that this is run
> before submitting a patch.

I believe it was Borland Turbo C, maybe 2.0.

At one point, I had a few different C compilers for MS-DOS, not so long
before I went to using OS/2.  It seems that Borland also had compilers
for OS/2, but I don't remember using those.

[toc] | [prev] | [next] | [standalone]

#3485 — Re: Old C compilers, binary search debugging of compilers

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2023-05-22 09:05 +0200
Subject	Re: Old C compilers, binary search debugging of compilers
Message-ID	<23-05-030@comp.compilers>
In reply to	#3483

On 5/21/23 5:20 AM, gah4 wrote:

> I believe it was Borland Turbo C, maybe 2.0.
>
> At one point, I had a few different C compilers for MS-DOS, not so long
> before I went to using OS/2.  It seems that Borland also had compilers
> for OS/2, but I don't remember using those.

Borland also developed Borland C++ for the Atari ST. The library was not
so perfect, as my decompiler could reveal during beta test. But it was
by far the best C/C++ compiler available for the ST.

DoDi

[toc] | [prev] | [next] | [standalone]

#3480

From	Max B <tekk.nolagi@gmail.com>
Date	2023-05-19 16:31 -0500
Message-ID	<23-05-024@comp.compilers>
In reply to	#3471

We did something similar -- delta debugging -- with the Cinder JIT. Except
we didn't have hashes, but instead lists of function names. It's an
incredibly useful approach.

See more at https://bernsteinbear.com/blog/cinder-jit-bisect/

[toc] | [prev] | [next] | [standalone]

#3463

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2023-05-14 19:59 +0000
Message-ID	<23-05-007@comp.compilers>
In reply to	#3459

Russ Cox <rsc@swtch.com> schrieb:

> As I mentioned at the top, I am interested to hear about earlier uses
> of approaches like this, or any other ideas for situations where the
> approach might be applicable. Clearly the general problem has overlaps
> with group testing [1], and in particular Hwang's binary search-based
> approach (1972) [2].

For GCC regression-hunting, two techniques are routinely used.

One of them is the use of "gcc bisect", described at
https://git-scm.com/docs/git-bisect .  If there is a failing test
case, this can (at the cost of some CPU time) usually pinpoint
offending commit.

For reducing test cases, at least for C and C++, cvise
https://github.com/marxin/cvise is now the mehod of choice; it is
brutally efficient at reducing test cases to an absolute minimum.
Because it can transform C and C++ programs, it can make
transformations of the program which are language-aware.

Both are tools external to the compiler.

[toc] | [prev] | [next] | [standalone]

#3465

From	Cameron McInally <cameron.mcinally@nyu.edu>
Date	2023-05-14 23:28 -0400
Message-ID	<23-05-009@comp.compilers>
In reply to	#3463

On Sun, May 14, 2023 at 22:42 Thomas Koenig <tkoenig@netcologne.de> wrote:

> Russ Cox <rsc@swtch.com> schrieb:
>
> > As I mentioned at the top, I am interested to hear about earlier uses
> > of approaches like this, or any other ideas for situations where the
> > approach might be applicable. Clearly the general problem has overlaps
> > with group testing [1], and in particular Hwang's binary search-based
> > approach (1972) [2].
>
> For GCC regression-hunting, two techniques are routinely used.
>
> One of them is the use of "gcc bisect", described at
> https://git-scm.com/docs/git-bisect
> .  If there is a failing test
> case, this can (at the cost of some CPU time) usually pinpoint
> offending commit.

Ah, this is close to my heart. The story I heard was that automated
regression bisection originated in the Cray compiler group around the
mid 1990s. I didn't join that team until the aughts, but can attest to
the quality of Cray's system, which is still in use today AFAIK.

https://ieeexplore.ieee.org/document/625082

For the general case of reducing runtime errors, the Wolf Fencing algorithm
would be a good thread to pull for uncovering the history.

https://dl.acm.org/doi/abs/10.1145/358690.358695

Hope that helps,
Cam

[toc] | [prev] | [next] | [standalone]

#3466

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-05-15 21:35 +0000
Message-ID	<23-05-010@comp.compilers>
In reply to	#3463

On 2023-05-14, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Russ Cox <rsc@swtch.com> schrieb:
>
>> As I mentioned at the top, I am interested to hear about earlier uses
>> of approaches like this, or any other ideas for situations where the
>> approach might be applicable. Clearly the general problem has overlaps
>> with group testing [1], and in particular Hwang's binary search-based
>> approach (1972) [2].
>
> For GCC regression-hunting, two techniques are routinely used.
>
> One of them is the use of "gcc bisect", described at
> https://git-scm.com/docs/git-bisect .  If there is a failing test
> case, this can (at the cost of some CPU time) usually pinpoint
> offending commit.

The technique described by Russ Cox is applicable when you know what
the offending commit is. E.g. you made some changes, say, to the
optimizer, and something is going wrong when you boostrap the compiler.
Obviously, the optimization changes are breaking something; the prior
commit works. But you don't understand what is breaking where.

>
> For reducing test cases, at least for C and C++, cvise
> https://github.com/marxin/cvise is now the mehod of choice; it is
> brutally efficient at reducing test cases to an absolute minimum.
> Because it can transform C and C++ programs, it can make
> transformations of the program which are language-aware.

This is only applicable if you have an external test case which
reproduces some compiler problem (like a crash) and would like
to make it smaller.

The binary search technique described by Cox is uniquely applicable
in the situation that something goes wrong with the bootstrapping
of a compiler.

1. A bug has been introduced into the compiler source.
   You already know exactly which commit did that, and
   what change is responsible, due to having done the git bisect
   or whatever. Maybe this is just new, uncommited work that is
   breaking compared to the working HEAD.

2. Consequently, the compiler miscompiles something in its own
   source code (anywhere in its source: in the compiler proiper, or
   standard library support routines of that language which are
   used by the compiler.)

3. Consequently, the compiled compiler misbehaves somehow.
   (Or possibly, the boostrapping even fails: compiled code is
   loaded into the compiler's image, and it cannot continue
   the boostrapping process.)

Characterizing what is wrong in (3) is not too difficult (though it can
be). The problem is that the issues exhibited in 3 are in good parts of
the compiler. So they are red herring issues.  If you look at the source
code for those parts, it is fine and has not been touched. Rather, the
object code is incorrect due to the introduced compiler problem.

You might know the commit which introduces the problem, yet be stumped
as to how, due to the complexity of all the pieces and how they
interact, and the difficulty of some of the algorithms. (Plus hidden
problems, like the new work actually being correct, but exposing a bug
in existing code; so staring at just the new code until you are blue
does no good.)

Cox's trick lets you turn on the new, known-bad code only for a
seletected subset of functions (or whatever logical units of the image).

You can iteratively shrink the subset until it contains only one
mistranslated unit.

Then you know---aha!--the bad optimizer change is, say, mistranslating
the partition function in stdlib/quicksort.mylang. If just that function
is treated with the buggy new change, and nothing else, the show grinds
to a halt.

So now you know you can focus your investigation on what the buggy
compiler change is doing to the partition function in
stdlib/quicksort.mylang.

You can write test cases exercising that function to understand
how its behavior has gone wrong, and study the object code to
see how it's organized to bring about the wrong behavior, and how
that relates the bad new code in the compiler.

You can tweak the code, and build the image with just that partition
function being processed with the new code to see whether the regression
has gone away, and iterate on that. Then try boostrapping the entire
image with the repaired code.

Etc.

The bug can show up as a mistranslation of anything, which is why I
picked the partition helper of quicksort example. That could wreck the
compiler's behavior, if some logic in it somemwhere depends on
sorting something. You could really be on a wild goose chase trying
to debug *that*, rather than it completely unrelated root cause.

> Both are tools external to the compiler.

Right; so not applicable to this internal technique where basically we
subdivide the compiler's own image into "parts compiled with the
known-good compiler" and "parts compiled with the new, known-bad change
in effect", where the parts are on a finer granularity than files.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [standalone]

csiph-web

binary search debugging of compilers

Contents

#3459 — binary search debugging of compilers

#3460

#3461

#3462

#3464

#3467

#3468

#3471

#3473

#3476

#3481

#3483

#3485 — Re: Old C compilers, binary search debugging of compilers

#3480

#3463

#3465

#3466