Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <864-117-4973@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: binary search debugging of compilers
Date: Mon, 15 May 2023 21:52:23 -0000 (UTC)
Organization: A noiseless patient Spider
Sender: johnl@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <23-05-011@comp.compilers>
References: <23-05-003@comp.compilers> <23-05-005@comp.compilers> <23-05-006@comp.compilers> <23-05-008@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="21354"; mail-complaints-to="abuse@iecc.com"
Keywords: debug, tools
Posted-Date: 15 May 2023 21:54:04 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3467

On 2023-05-14, gah4 <gah4@u.washington.edu> wrote:
> On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote:
>
> (snip)
>
>> Setup:
>>
>> We can (somehow) enumerate the N entities with integers, which we
>> express using a pure binary enumeration. Then, we can discover the
>> number of an erroneously processed element, one bit at a time.
>>
>> Procedure:
>>
>> First we treat all elements numbered ..XXXXXX0 using the new technique
>> technique, and all others using the the old technique. If the
>> system fails, we know that it's the ..XXXXX0 set which has one or more
>> badly processed elements. Otherwise it's the other set, whose
>> enumerations end in a 1. Either way, we have discovered the identity of
>> the last bit.
>
> There is the assumption that only one of the N is in error.

You might think, but it's not actually a problem with the algorithm.

When we bisect a change history, there is the assumption that the
bug showed up, so there are good commits before that, and bad commits
after. If the history is not like that: the bug disappears and
reappears, then the bisect will not identify *the* bad commit reliably,
because there isn't one.

I have a story like this; but it digresses from my present point, so
I will return to it below.

In this situation, we are bisecting a space of N entities. All we need
is to find one bad one. If there are three bad ones, it doesn't matter
which one we find.

Say we have a basket of apples with a rotten smell emanating from it.
We can subdivide it, and smell one half and the other. If both halves
smell, we know we have two or more rotten apples and they ended up
in different halves. This doesn't matter. We just pick one half and
keep subdividing. As long as we stay on the trail of the rotten scent, we will
get down to one rotten apple, and we can use that apple to analyze further:
what kind of mould or bacterium has infected it and so on. Probably
the other rotten apples have the same root cause. If they have different root
causes, we can do another search after fixing the one root cause we have found.

Now about the story. The conclusion of the story was that there was
a GC-related bug in an ash (arithmetic shift) function, dealing with
heap-allocated bignum integers.

A bignum integer was being prematurely reclaimed by the garbage
collector.

The fix looked like this:

commit 9cf992835c8a0f2e6c4ace07b67fea2acb762cc5
Author: Kaz Kylheku <kaz@kylheku.com>
Date:   Wed Jun 19 23:21:39 2019 -0700

    ash: gc problem.

    On 32 bit x86 Solaris 10, with gcc 4.9.0, this issue caused a
    miscompilation of the pset macro, due to ash abruptly
    returning zero, causing the op-code field of an instruction to
    be NOOP rather than GCALL.

    I'm committing this fix just for reference; I will immediately
    replace it with a refactoring of the function.

    * arith.c (ash): Add a gc_hint to prevent the a bignum from
    being reclaimed if the b = make_bignum() triggers gc.
    This happens in the case when the previous case computes
    a = make_bignum() and falls through.

diff --git a/arith.c b/arith.c
index fdb294b7..f374ddf4 100644
--- a/arith.c
+++ b/arith.c
@@ -3235,6 +3235,7 @@ val ash(val a, val bits)
       b = make_bignum();
       if ((mpe = mp_shift(mp(a), mp(b), bn)) != MP_OKAY)
         break;
+      gc_hint(a);
       return normalize(b);
     default:
       goto bad3;

The problem only showed upon Solaris (first red flag).  When I git bisected it,
it was randomly appearing and disappearing, so git bisect was of no use.

Not all problems introduce themselves in a way that there is a reproducible
test case.

Here, the garbage collector had to go off exactly at the wrong time during the
execution of the ash function. Small perturbations to completely unrelated code
can make it go away.

In the end, I had to debug it on Solaris (no working gdb or anything), using
the repro that I had.

Once I discovered the bad VM instruction, which was affecting the behavior of a
certain Lisp macro in the standard library, I traced it through the compiler
right down to the assembler, where the opcode field for a good assembly
instruction was being miscalculated. I could not reproduce the bad ash
calculation though in isolation.

In summary:

1. non-gc-aware, sloppy coding in ash function caused ...
2. rare problem in assembler's masking and shifting calculations, causing ...
3. Bad instruction when compiling the "pset" (parallel assignment) macro ...
4. Which failed some code relying pset.

All of this reproduced only one one platform, one particular commit.
Small perturbations of Lisp code just about anywhere made it go away.
Bisect was useless.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca