Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.compilers
Subject: Re: Undefined behaviour, was: for or against equality
Date: Fri, 7 Jan 2022 15:56:22 +0100
Organization: Compilers Central
Lines: 150
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-01-029@comp.compilers>
References: <17d70d74-1cf1-cc41-6b38-c0b307aeb35a@gkc.org.uk> <22-01-016@comp.compilers> <22-01-018@comp.compilers> <7f4f52f2-49ee-9e80-1f03-c3fb9c74f574@gkc.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="31420"; mail-complaints-to="abuse@iecc.com"
Keywords: standards, semantics
Posted-Date: 07 Jan 2022 20:27:05 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-GB
In-Reply-To: <7f4f52f2-49ee-9e80-1f03-c3fb9c74f574@gkc.org.uk>
Xref: csiph.com comp.compilers:2805

On 07/01/2022 15:02, Martin Ward wrote:
> On 06/01/2022 08:11, David Brown wrote:
>> The trick is to memorize the/defined/  behaviours, and stick to them.
>
> Isn't the set of defined behaviours bigger than the set
> of undefined behaviours? How do you know what is defined
> if you don't know what is undefined?

You know what is "defined" because you can find the definition for it -
everything else is undefined.  You could enumerate all defined
behaviours for a language - after all, the documentation (language
standards, compiler manual, library documentation, etc.) is finite.  It
doesn't really make sense to try to find how many undefined behaviours
there are - it's like asking how many things are there that are apples.

Language standards tell you the defined behaviour for a language.
Anything that is not there, is undefined - that's simply what the word
"undefined" means.

Note that there are many other things besides language standards that
define behaviour of code in practice - compilers or interpreters can add
their own definitions to things that are not defined by the language
standards, as can additional standards such as POSIX.

If you write a function "foo" - perhaps written in the same language
(such as C), perhaps in a completely different language - then its
behaviour is not defined by the language standards.  It is not mentioned
anywhere in those documents, so it is undefined.  (That is different
from functions whose behaviour is specified in the standard, such as
"memcpy".)

Undefined behaviour, as far as language standards are concerned, are
omnipresent in programming - for all languages.  The problem only comes
when you attempt to execute something that does not have its behaviour
defined /anywhere/.  Then it is incorrect code - a bug.


When I learned to program (i.e., during my university education rather
than from books, magazines and trial and error previous to that), we
were very clear about how a function is specified.  You have a
pre-condition and a post-condition.  The function can assume the
pre-condition is logically "true", and it will guarantee that the
post-condition is true at the exit.  (Typically you also have an
"invariant" that is a clause in both parts, but that is just for
convenience.)  If the function is called when the pre-condition is
false, the function has no obligation to do anything - it can give an
error, launch nasal daemons, give the answer it thinks the programmer
hoped for, or anything else.  The behaviour is undefined.

This concept has existed since the dawn of programming:

"""
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
the machine wrong figures, will the right answers come out?' I am not
able rightly to apprehend the kind of confusion of ideas that could
provoke such a question.

Charles Babbage
"""


The C standards contain a fair number of explicit undefined behaviours.
They do that for convenience and clarity, and often to encourage
compiler developers towards greater efficiency rather than run-time
checks, and to encourage programmers towards not assuming particular
behaviours even if one compiler happens to define the behaviour.  So a
compiler writer knows that they can assume "a + b" never overflows (for
integer arithmetic), and a programmer knows that they can't assume
signed arithmetic is wrapping even if the compiler they are using at the
time /guarantees/ wrapping behaviour.  (I have never seen a C compiler
that guarantees this without explicit flags.)

C is a language that expects the programmer to take responsibility for
his or her code, and ensure that it is correct.  Fortunately, good
compiler developers know this is difficult and provide tools to help
people find their bugs.  Thus you have a language that can give
efficient results, /and/ provide good debugging and run-time checking,
as long as you get good tools and understand how to use them.


>
> For example, a = b + c is precisely defined in C and C++ for
> floating point variables, but the result can be "undefined behaviour"
> for ordinary 32 bit signed integer values.
>

Actually, it is not precisely defined for floating point operations - if
there is an "exceptional condition" during the evaluation (the result is
not mathematically defined or not in the range of representable values
for its type), the behaviour is undefined.  That applies to all
expressions - integer and floating point.

Now, it is very common (but certainly not universal) for C
implementations to use IEEE floating point formats and rules.  These
provide the "mathematical definitions" for floating point operations,
including handling of calculations outside the normal ranges.  But if
you are not using these, such calculations could result in undefined
behaviour.  (For example, if you use "gcc -ffast-math", the compiler
will assume that all expressions are normal finite numbers - that's
perfectly valid for C, and can be very much more efficient on a lot of
targets.)

Signed integer overflow is undefined behaviour on most compilers (the
size is not necessarily 32-bit).  The only one I know that defines the
behaviour is gcc (and compatibles, such as clang and icc) with the
"-fwrapv" flag enabled.

And of course that makes perfect sense.  It is logical to assume that if
you add two positive numbers, you get a positive number - it is
illogical to suppose that sometimes the "correct" answer will be
negative.  Some programming languages (such as Java) specifically define
signed integer arithmetic to be wrapping - the result is that sometimes
you get the wrong answer in Java, while in C you would get undefined
behaviour.  Wrong answers are less helpful - leaving the behaviour
undefined means you get more efficient code and that you can use
debugging tools (such as gcc's -fsantitize=undefined) to help find the
errors in your code.


> If you want to stick to defined behaviours then you need
> to add extra code. For example, CERT recommends:
>
>    if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
>        ((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
>      /* Handle error */
>    } else {
>      sum = si_a + si_b;
>    }
>

That is /not/ code to "stick to defined behaviours".  It is code to
identify problems and perhaps find some way to handle it (depending on
what the "handle error" code is).

You can "stick to defined behaviour" much more simply:

	int sum = (unsigned int) si_a + (unsigned int) si_b;

The behaviour is fully defined, and the result will be wrong if there is
an overflow - just like when you use a language that has fully defined
signed integer arithmetic by wrapping.


The answer here is /not/ to worry about what happens when your
expressions overflow and you get undefined behaviour.  The answer is to
think about the code you are writing, and make sure that the types and
expressions you write are appropriate for the values you have.  Check
your values for validity when you get them in (from files, user input,
etc.), then write code that is correct for the full range of values.
Simple.  (Well, as simple as any programming!)