Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <864-117-4973@kylheku.com> Newsgroups: comp.compilers Subject: Re: binary search debugging of compilers Date: Mon, 15 May 2023 21:52:23 -0000 (UTC) Organization: A noiseless patient Spider Sender: johnl@iecc.com Approved: comp.compilers@iecc.com Message-ID: <23-05-011@comp.compilers> References: <23-05-003@comp.compilers> <23-05-005@comp.compilers> <23-05-006@comp.compilers> <23-05-008@comp.compilers> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="21354"; mail-complaints-to="abuse@iecc.com" Keywords: debug, tools Posted-Date: 15 May 2023 21:54:04 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3467 On 2023-05-14, gah4 wrote: > On Sunday, May 14, 2023 at 9:20:54 AM UTC-7, Kaz Kylheku wrote: > > (snip) > >> Setup: >> >> We can (somehow) enumerate the N entities with integers, which we >> express using a pure binary enumeration. Then, we can discover the >> number of an erroneously processed element, one bit at a time. >> >> Procedure: >> >> First we treat all elements numbered ..XXXXXX0 using the new technique >> technique, and all others using the the old technique. If the >> system fails, we know that it's the ..XXXXX0 set which has one or more >> badly processed elements. Otherwise it's the other set, whose >> enumerations end in a 1. Either way, we have discovered the identity of >> the last bit. > > There is the assumption that only one of the N is in error. You might think, but it's not actually a problem with the algorithm. When we bisect a change history, there is the assumption that the bug showed up, so there are good commits before that, and bad commits after. If the history is not like that: the bug disappears and reappears, then the bisect will not identify *the* bad commit reliably, because there isn't one. I have a story like this; but it digresses from my present point, so I will return to it below. In this situation, we are bisecting a space of N entities. All we need is to find one bad one. If there are three bad ones, it doesn't matter which one we find. Say we have a basket of apples with a rotten smell emanating from it. We can subdivide it, and smell one half and the other. If both halves smell, we know we have two or more rotten apples and they ended up in different halves. This doesn't matter. We just pick one half and keep subdividing. As long as we stay on the trail of the rotten scent, we will get down to one rotten apple, and we can use that apple to analyze further: what kind of mould or bacterium has infected it and so on. Probably the other rotten apples have the same root cause. If they have different root causes, we can do another search after fixing the one root cause we have found. Now about the story. The conclusion of the story was that there was a GC-related bug in an ash (arithmetic shift) function, dealing with heap-allocated bignum integers. A bignum integer was being prematurely reclaimed by the garbage collector. The fix looked like this: commit 9cf992835c8a0f2e6c4ace07b67fea2acb762cc5 Author: Kaz Kylheku Date: Wed Jun 19 23:21:39 2019 -0700 ash: gc problem. On 32 bit x86 Solaris 10, with gcc 4.9.0, this issue caused a miscompilation of the pset macro, due to ash abruptly returning zero, causing the op-code field of an instruction to be NOOP rather than GCALL. I'm committing this fix just for reference; I will immediately replace it with a refactoring of the function. * arith.c (ash): Add a gc_hint to prevent the a bignum from being reclaimed if the b = make_bignum() triggers gc. This happens in the case when the previous case computes a = make_bignum() and falls through. diff --git a/arith.c b/arith.c index fdb294b7..f374ddf4 100644 --- a/arith.c +++ b/arith.c @@ -3235,6 +3235,7 @@ val ash(val a, val bits) b = make_bignum(); if ((mpe = mp_shift(mp(a), mp(b), bn)) != MP_OKAY) break; + gc_hint(a); return normalize(b); default: goto bad3; The problem only showed upon Solaris (first red flag). When I git bisected it, it was randomly appearing and disappearing, so git bisect was of no use. Not all problems introduce themselves in a way that there is a reproducible test case. Here, the garbage collector had to go off exactly at the wrong time during the execution of the ash function. Small perturbations to completely unrelated code can make it go away. In the end, I had to debug it on Solaris (no working gdb or anything), using the repro that I had. Once I discovered the bad VM instruction, which was affecting the behavior of a certain Lisp macro in the standard library, I traced it through the compiler right down to the assembler, where the opcode field for a good assembly instruction was being miscalculated. I could not reproduce the bad ash calculation though in isolation. In summary: 1. non-gc-aware, sloppy coding in ash function caused ... 2. rare problem in assembler's masking and shifting calculations, causing ... 3. Bad instruction when compiling the "pset" (parallel assignment) macro ... 4. Which failed some code relying pset. All of this reproduced only one one platform, one particular commit. Small perturbations of Lisp code just about anywhere made it go away. Bisect was useless. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca