Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: David Brown Newsgroups: comp.compilers Subject: Re: Undefined Behavior Optimizations in C Date: Tue, 10 Jan 2023 17:32:28 +0100 Organization: A noiseless patient Spider Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <23-01-032@comp.compilers> References: <23-01-009@comp.compilers> <23-01-011@comp.compilers> <23-01-012@comp.compilers> <23-01-017@comp.compilers> <23-01-027@comp.compilers> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="79420"; mail-complaints-to="abuse@iecc.com" Keywords: C, optimize Posted-Date: 10 Jan 2023 17:00:54 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3300 On 09/01/2023 11:14, Kaz Kylheku wrote: > On 2023-01-06, David Brown wrote: >> On 06/01/2023 01:22, gah4 wrote: >>> Most important when debugging, is that you can trust the compiler to >>> do what you said. That they don't, has always been part of >>> optimization, but these UB make it worse. >> >> The trouble with undefined behaviour is that, in general, you cannot >> trust the compiler to "do what you say" because it cannot know what you >> have said. > > That's correcst; however, what we should be able to trust is for the > compiler not to make it worse. > Let's be clear here. At the theoretical level, when your code executes undefined behaviour at run time, you have said "I don't care what happens now". The compiler is allowed to do /anything/. There is no such thing as "making it worse" - you have said you will accept absolutely /anything/ from the compiler. No restrictions, no limitations - /anything/. That is what "undefined behaviour" means. There are /no/ definitions of the behaviour, and no limits, expectations, "worse" choices, "better" choices. At a /practical/ level, compilers can try to be more helpful. They can define behaviour for things that are not defined in the language specifications. They can give you warnings or error messages (at compile time or run time, such as when using sanitizers). They can offer optional extra semantics (like gcc's "-fwrapv" flag). They can certainly avoid going out of their way to avoid being specially awkward - normally they are simply trying to make the code more efficient when it is running correctly. But you cannot be sure things won't go unexpectedly bad. You can never have guarantees about that. Guarantees require definitions, and you don't have that with undefined behaviour. Consider this example source code: void print_message(bool x) { if (x) { printf("Great!\n"); } else { printf("Boo!\n"); } } void format_harddisk(int disk_number) { ... } This is fine code, with no UB in sight. The compiler could this to the pseudo-code : print_message: if ($GPR1 == 0) : $GPR1 = "Boo!" jump puts if ($GPR1 == 1) : $GPR1 = "Great!" jump puts format_harddisk: ... It knows that on entry, the parameter in GPR1 is either 0 or 1, because it is type "bool". What happens if you call the function from a different translation unit like this : extern void print_message(int x); print_message(2); ? This is clearly wrong - clearly undefined behaviour. And the result would be a formatted disk. But there is nothing wrong with the compiler's generated code. I have seen other occasions when compiler's have made code with booleans that appear to be both true and false, or neither true nor false, as a result of undefined behaviour setting the underlying memory to something other than 0 or 1, simply because that was the result of the most efficient code. There are all sorts of other ways a compiler can take advantage of knowing a boolean is either 0 or 1, and all sorts of things that can go wrong if you've messed up your code and the boolean is /not/ 0 or 1. It can use it as an index into an array, or to multiply a value (translating "x = f ? a : b;" into "x = f * (a - b) + b;" ), or using it as a bit mask, or subtracting 1 and using /that/ as a mask. And once you have started getting the wrong values, there is no theoretical limit to how bad things can get - it is not the compiler "making things worse", it is bugs in the code being outside the specifications for what is intended for the program. Of course the compiler should not be messing around checking that the value you pass is actually 0 or 1. That would be inefficient. And what would it do if it was a bad value? Treat it as "false" ? What about calling "missile_control(bool cancel_launch)" with a bad value? Should it treat it as "true" ? What about calling "missile_control(bool confirm_launch)" with a bad value? Jump to an error handler and stop the program with a message? How does that work for your flight controller? If you want a hand-holding check everything language, there are plenty of them to go around. They're great, and far more appropriate than C for many purposes. C, on the other hand, trusts the programmer and gives you the most efficient object code it can from the source you give it, based on the rules of the language (plus any extra features defined by the compiler). > The compiler makes it worse when it assumes that the programmer is > infallible, and thus makes logical inferences predicated on some > construct never having undefined behavior, using those to guide the > translation of other constructs. It is the programmers' /job/ to write correct code. It is up to them to use any and all tools available to do that - static error checking, run-time sanitizers, code reviews, unit tests, system tests, etc. That also includes using the write language for the task in hand. If the code is critical, but too complex to be written bug-free in C, then maybe a different language would be better - or maybe a different programmer. I am not suggesting that I always write bug-free code. But the bugs I put in my code are /my/ bugs - /my/ responsibility. And I don't blame the compiler for the effects of these bugs. > > This is particularly harmful when the undefined behavior is *de facto* > defined: like that if the undefinedness aspect is ignored, and the > obvious code is emitted, that code will do something characteristic > of the machine. It is particularly harmful when programmers think there is such a thing as "de facto defined". That's an oxymoron. If the behaviour is defined, it is defined. If it is not defined, it is not defined. If it is not defined and a programmer makes unwarranted and incorrect assumptions about what they think it means, then the programmer needs to update his or her understanding of the language. They don't get to blame the compiler or the compiler writer for not making the same unfounded assumptions that they did. > > In C99, the original definition of undefined behavior is this: > > behavior, upon use of a nonportable or erroneous program construct or > of erroneous data, for which this International Standard imposes no > requirements. > > NOTE: Possible undefined behavior ranges from ignoring the situation > completely with unpredictable results, to behaving during translation > or program execution in a documented manner characteristic of the > environment (with or without the issuance of a diagnostic message), to > terminating a translation or execution (with the issuance of a > diagnostic message). > > There is an obvious intepretation of this text which rules out > the too-clever optimizations. > I presume you mean "a documented manner characteristic of the environment". That would be something like gcc's "-fwrapv" flag. It takes something that is undefined in the C standards - signed integer overflow - and gives it a /documented/ behaviour. The key here is "documented". Now the behaviour is /defined/ - it is UB as far as the C standards are concerned, but /defined/ behaviour as far as the implementation is concerned. And you can rely on it, as long as you use a compiler with such a flag and documented behaviour. The more general and obvious interpretation of the definition is "imposes no requirements" - you have no right to have any expectations about the results of the undefined behaviour. In particular, while you might have some guesses as to what will happen just at the time, you have no idea about knock-on effects. > If the compiler optimizes construct Y based on some other construct > X being free of undefined behavior, and that assumption turns > out to be false, that compiler has not conformed to the "NOTE" > part of the treatment of undefined behavior. > > Firstly, it has not "ignor[ed] the situation completely". You cannot > assume that X is well-defined, in order to treat Y in some way, and yet > say that you're completely ignoring the situaton of X. If the UB > problem in X causes a secondary problem with the way Y was translated, > that is a problem with the assumption that was made about X. That > assumption was made because it's possible for X to be undefined, and > that possiblity was expliclty disregarded, which is not the same as > completely ignored. > > The situation also isn't a documented manner characteristic of > the environment. > > It also isn't a termination of translation or execution with or without > a diagnostic message. > > The situation doesn't conform to the NOTE. > > C compiler writers should take the NOTE more seriously. The "note" is just a note, not part of the language definitions, and it gives some example treatments of undefined behaviour, not an exclusive list of options. No, compiler writers do /not/ have to take it seriously. If the C standards committee had wanted to impose restrictions on what can happen during undefined behaviour, they would have given them instead of saying "no requirements". (And the committee are smart enough to know that imposing restrictions to UB in general is not possible.) Of course it would be possible to impose restrictions in particular cases. It would be fine to say, for example, that signed integer overflow gives an unspecified integer value as a result, rather than undefined behaviour. Or that attempting to dereference an invalid pointer will result in the memory being written or read regardless. But that would be giving definitions to behaviour that is currently undefined - you cannot impose meaning on undefined behaviour itself.