Path: csiph.com!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.lang.c
Subject: Re: portable way to get highest bit set?
Date: Mon, 13 Nov 2023 07:06:37 -0800
Organization: A noiseless patient Spider
Lines: 189
Message-ID: <86a5rhzo8y.fsf@linuxsc.com>
References: <ug5gvh$1jkar$3@dont-email.me> <20231011102714.44a870af4dfe68f756974953@g{oogle}mail.com> <ug6huc$1rvp1$1@dont-email.me> <86h6mxawqq.fsf@linuxsc.com> <20231012111100.272c96b3209baad26a150e55@g{oogle}mail.com> <86cyxkb2ka.fsf@linuxsc.com> <20231012141719.99f5a10ec921db3ee6f7d948@g{oogle}mail.com> <864jiwaqic.fsf@linuxsc.com> <20231012173021.0000149c@yahoo.com> <86v8bbanjv.fsf@linuxsc.com> <20231013021504.12623444fc7d7fdaab87f1e0@gmail.moc> <86mswm9mwk.fsf@linuxsc.com> <20231014015035.a51cbb621de8eea5ac6a8651@gmail.moc> <20231017183345.26e2afd8641d5c63f8e8e630@gmail.moc> <86o7gv5bk1.fsf@linuxsc.com> <20231019122318.424cc40a22971a045ef11be5@gmail.moc>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="3143d63c853d96637c76865c2342b4ed"; logging-data="744437"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+x4/C/fQ92TLJiJqXiw0+Alf39S6kW5PY="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:oyDh89XbjT/A1whtuQ4K0Uk4Qq8= sha1:Zy8j8MOey93z0aYavAlTmF2AEsA=
Xref: csiph.com comp.lang.c:379502

Anton Shepelev <anton.txt@gmail.moc> writes:

> Tim Rentsch to Anton Shepelev:
> [earlier version removed]
>
>> I would like to take any code posted and incorporate it
>> into my little test bench here so I can test it and see
>> how it does, but having to go through this code and change
>> all the 'unsigned long's and so forth makes that a big
>> pain in the ass.
>
> OK:
>
> T
> high_bit_mask( T n )
> {  const T   M    =  (T)1 << (CHAR_BIT * sizeof( T ) - 1);
>          T   mask =  0;
>          int l    = -1,
>              r    =  CHAR_BIT * sizeof( T );
>
>    do
>    {  const int m = (l + r)/2;
>             T   C = M >> m;
>
>       if( n >= C ) { r = m; mask = C; }
>       else         { l = m;           }
>    } while( r - l > 1 );
>
>    return mask;
> }
>
>> The following binary search code is simpler and runs faster.
>>
>>     T
>>     hbs_binary( T u ){
>>         const T one = 1, v = u>>1;
>>         unsigned least = 0, limit = MASK_WIDTH( -one );
>>
>>         do {
>>             const unsigned k = least+limit >> 1;
>>             const T bit = one << k;
>>
>>             /**/ if(  bit <= v  )  least = k+1;
>>             else if(  bit >  u  )  limit = k;
>>             else                   return  bit;
>>
>>         } while(  limit > 0  );
>>
>>         return  0;
>>     }
>
> I disagree about the higher simplicity of your code,

I have responses to your comments below, followed by some more
general remarks about your rendition.

> because it
>
>   1.  has two return statements,

Having two return statements makes the code easier to understand,
not harder.  During the course of the loop if the answer is found
it is returned immediately.  It's easy to see from the code that
the right value is being returned;  just immediately returning it
makes it easier to reason about what else is going on.

General comment:  a dogmatic aversion to having multiple returns
is wrongheaded, especially in pure functions like this one.


>   2.  employs the /clever trick/ of comparing the mask to
>       different values in different branches.

It isn't particularly clever, and it certainly isn't a trick.
It's easy to see that, if

    u>>1 < bit  &&  bit <= u

then bit is the value we are looking for, so we're done.  Outside
of that range the value is either too high or too low, and the
boundary markers are adjusted accordingly.


>   3.  uses a three-way `if'.

There are three branches because the interval subdivides naturally
into three parts.  The problem here is unlike a general binary
search in that we know there is exactly one solution, and it will
be an exact match to one of the generated values.  Searching for a
value in an array is different because, one, we aren't sure in
general that the value we're looking for will be there, and two,
the array might contain duplicate values.  Neither of those apply
here, so the three-way choice is natural.  In a sense we are
deciding what to do depending on whether the generated bit value
is less than, greater than, or equal to, the right value.  By
contrast the two-way branch is either greater than or {less than
or maybe equal} (or maybe the other way around; it isn't easy to
tell).  The three-way test is a more natural fit, and easier to
reason about for determining correctness.


>   4.  uses conditions that are not mutually exclusive.

Given an argument value 'u' and a generated test value 'bit',
exactly one of

       bit <= u<<1

              u<<1 < bit  &&  bit <= u

                                     u < bit

is true.  These conditions correspond directly to the three
branches of the if-else chain, and they are mutually exclusive.
So either your statement is wrong or I don't know what you
mean.


>   5.  uses `least' and `limit' instead of intuitive `left'
>       and `right' or `min' and `max'.  My version makes a
>       point of giving its left and right indices the
>       intuitive meaning that they have in the standard
>       positional notation.

Pshaw.  The comparison operators in C are less than and greater
than, not left-of and right-of.  Calling the end markers left and
right means having to mentally translate from a left/right
coordinate system to the less-than/greater-then numeric coordinate
system that C uses.  Another problem with left/right is that they
seem to be symmetric.  Generally reasoning about binary search is
easier if a half-open interval is used (which indeed is what my
code does).  The name 'limit' is chosen to indicate which boundary
marker corresponds to the half-open end, and 'least' for the
closed end (i.e., the 'least' value can be possible, whereas the
'limit' value never is).

I would never use either 'left' and 'right' or 'max' and 'min' for
names of the boundary markers in a binary search, because they are
either confusingly or misleadingly evocative.  Also I think of
'left' as "bigger", because digits more leftward represent larger
values.  For me your 'left' and 'right' are backwards.


> My version, however, is still not a plain-vanilla binary search.

It's a different problem, not like a typical binary search.

For me "simpler" usually means "easier to convince myself that it
works."   There are several things about the code you posted that
cause it to take more mental effort for me to understand:

    name choices that make it harder to see how the code works

    confusingly shifting a top bit down rather than a low bit
    up;  shifting a low bit up is easier because then the
    generated value (what I call bit) goes in the same direction
    as the quantity used to produce it

    choice of 'int' for the boundary markers, including one
    initialized to a negative number.  Thinking of the boundary
    markers as holding shift values on '1' is natural, and it's
    easy to see that the initial values are right:  the least
    shift is zero, and the width of the argument type is one
    more than the largest shift allowed.  Shifting a high-order
    bit means I have to think harder about the calculation to
    produce the high-order bit, and more mental effort is 
    needed to translate between the shift values and the
    generated bit value, because numerically they go in opposite
    directions, making the relationship more complicated (and
    also different for different width types).  Seeing an 'int'
    used as a shift amount, with one of the boundary markers
    initialized with a negative value, makes me nervous, because
    I need to think carefully to be sure there is never a shift
    with a negative shift amount.  Using an unsigned type for
    the shift amounts gets rid of that problem.

Probably my biggest complaint is that to see how the code is
working I need to keep track of the state changes to 'mask' to see
that it has the right value.  This property could be a consequence
of avoiding multiple returns:  if there is a rule against having
more than one return, that can cause one or more of (a) the code
needing to be altered to avoid doing an early return, (b) having
to use some kind of intra-function jump (goto or break), or (c)
needing a variable to hold the return value, so it can be returned
by the one 'return' statement.  The code you posted has such a
variable, and that results in more mental effort needed to see
whether the function works.  Needing that extra variable is a huge
cost in terms of reasoning about the code, and that by itself is
enough for me see my code as simpler.