Path: csiph.com!news.mixmin.net!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch Newsgroups: comp.std.c Subject: Re: Does reading an uninitialized object have undefined behavior? Date: Thu, 03 Aug 2023 13:13:26 -0700 Organization: A noiseless patient Spider Lines: 448 Message-ID: <864jlfj34p.fsf@linuxsc.com> References: <87zg3pq1ym.fsf@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: dont-email.me; posting-host="4bb75e9aac326999a2b930a7e60305cb"; logging-data="960272"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191Z1UJl/0gemhEyyAt9KWLiTgeEaEpT0I=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:MSCkZZMoNKkX4D18PGcwmmQUEg8= sha1:FXWB6fh5sFCT2Ut5Dh+MbURceOc= Xref: csiph.com comp.std.c:6525 Repeating the question stated in the Subject line: Does reading an uninitialized object [always] have undefined behavior? Background: Annex J part 2 says (in various phrasings in different revisions of the C standard, with the one below being taken from C90): The value of an uninitialized object that has automatic storage duration is used before a value is assigned [is undefined behavior] (6.5.7) Remembering that Annex J is informative rather than normative, is this statement right even for a type that has no trap representations? To ask that question another way, is this statement always right or is it just a (perhaps useful) approximation? I think this question can be answered convincingly by reviewing the subject's history in each revision of the ISO C standard. We start in C90. In C90 reading the value of an uninitialized object is always undefined behavior (and that includes malloc()ed storage as well as automatic storage duration objects). The C90 standard says, in 6.5.7: If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. and in 7.10.3.3: The malloc function allocates space for an object whose size is specified by size and whose value is indeterminate. The term "indeterminate" is not defined in C90, but accessing storage that is indeterminate is explicitly undefined behavior. Indeed such uses are part of the /definition/ of undefined behavior - C90 says in 3.16 (which is an entry in Definitions): undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements. So for C90 we have a clear answer: always undefined behavior for accessing any uninitialized object. Unfortunately the C90 scheme has some serious issues. There is no exception for reading using a character type. More seriously, although C90 gives some situations that cause values to be indeterminate, it doesn't say anything about making them /not/ be indeterminate. We can guess (but only guess) that assigning a value to the object as a whole removes indeterminate-ness, but what about these cases (and other similar ones): int x; *(char*)&x = 0; // is the value of x now indeterminate or not? struct { int x, y; } s; s.x = 0; // is the value of s now indeterminate or not? Again, we can make guesses about what these answers should be, but the C90 standard doesn't say. Clearly C90 has some significant deficiencies. Next we look at C99. (Actually, before we do that, I should mention that C90 was amended and corrected in 1994, 1995, and 1996, by the three intermediate documents ISO/IEC 9899/COR1, ISO/IEC 9899/AMD1, and ISO/IEC 9899/COR2. As far as I am aware these revisions have no bearing on the matter at hand.) The C99 standard represents a substantial revision and expansion of the C90 standard. The relationship between uninitialized memory and undefined behavior is nearly completely rewritten, and also made more concrete. There's lots to look at here. Starting at the top, the definition of undefined behavior is revised not to give any mention of indeterminately valued objects. Here is section 3.4.3 paragraph 1: undefined behavior behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements (Incidentally the section and paragraph references given in this part of the discussion are relative to the ISO N1256 document.) The next most prominent change is that "indeterminate value" is explicitly defined, in section 3.17.2 paragraph 1: indeterminate value either an unspecified value or a trap representation This definition makes use of two new terms, "unspecified value" and "trap representation", that were not used in C90. The term unspecified value is defined immediately following, in 3.17.3 p1: unspecified value valid value of the relevant type where this International Standard imposes no requirements on which value is chosen in any instance There is also an informative note in p2: NOTE An unspecified value cannot be a trap representation. The term "trap representation" is defined in 6.2.6.1 p5: Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined.41) Such a representation is called a /trap representation/. The slant characters around "trap representation" indicate italics, which the C standard uses to denote a term being defined. Also there is a '41)' footnote reference 41) Thus, an automatic variable can be initialized to a trap representation without causing undefined behavior, but the value of the variable cannot be used until a proper value is stored in it. which underscores the non-undefined-behavior aspect of using character types to change the object representation (and hence the value) of an object. The C99 text doesn't use the term "trap representation" very often. There are several cases where certain types are ruled out from having trap representations; a few cases where a result /might be/ a trap representation; and a case involving integer types where there is an implementation-defined choice as to whether a specific combination of value bits is a valid value or a trap representation. Also, in Annex J part 2, the list of undefined behaviors, there are these summary items: A trap representation is read by an lvalue expression that does not have character type (6.2.6.1). A trap representation is produced by a side effect that modifies any part of the object using an lvalue expression that does not have character type (6.2.6.1). which of course correspond directly to what is said in the definition of trap representation. Based on various passages in section 6.2.6, which describes the representation of types, we can deduce that for some integer types all bit combinations must be a valid value, and so no trap representations are possible for those types. Such types always include 'unsigned char', and may also include other integer types depending on the size of the type, the value of CHAR_BIT, and the values given in for the range of the type in question. (More concretely, if the set of distinct values for type T has 2**(sizeof(T)*CHAR_BIT) elements, then all object representations are valid values, and thus type T cannot have any trap representations.) There are three points worth mentioning regarding unspecified values and trap representations. One is that unspecified values are always valid values, and never by themselves cause undefined behavior. Two is that the distinction between an unspecified value and a trap representation depends on the type used to access the object. Three is that, once we know the type of an access, whether a given object holds a valid value or a trap representation depends only on the bits and bytes that make up the object representation of the object, and in particular not on any hidden "magic" state associated with the object. (There is one case though that deserves a closer look, which is explained further on.) The rule for trap representations is simple and clear: any access of an object whose object representation is a trap representation of the access's type is undefined behavior, and this consequence is accurately portrayed in Annex J part 2. Having settled the question for trap representations, how about indeterminate values? Ruling out the definition and an entry in the index, the term "indeterminate value" (or values plural) appears in just six places in the C99 standard: three in informative passages (usually examples), and three normative passages, those being 6.7.8 paragraph 9 (about unnamed members), 6.8 paragraph 3 (about declarations for objects with automatic storage duration), and 7.20.3.4 paragraph 2 (about bytes added by a call to realloc()). The sentence in 6.8 paragraph 3 deserves quoting: The initializers of objects that have automatic storage duration, and the variable length array declarators of ordinary identifiers with block scope, are evaluated and the values are stored in the objects (including storing an indeterminate value in objects without an initializer) each time the declaration is reached in the order of execution, as if it were a statement, and within each declaration in the order that declarators appear. Section 7 has many places where the word "indeterminate" appears without being followed by "value". I think most of these can be safely skipped over, but the description of malloc() deserves quoting (it is 7.20.3.3 paragraph 2): The malloc function allocates space for an object whose size is specified by size and whose value is indeterminate. Presumably the sentence here is meant to express the same idea as the parallel passage describing the results from realloc(), which says (in 7.20.3.4 paragraph 2): Any bytes in the new object beyond the size of the old object have indeterminate values. The word "indeterminate" without being followed by "value" is used in just six other places in the standard: five in the main body (all of which are part of normative text), plus one entry in Annex J part 2 (which is of course informative). The normative uses may be seen to be in two categories, as follows. Four of the five normative uses are basically restatements of the long sentence from 6.8 paragraph 3; they are in 6.2.4 paragraph 5 (two uses) and paragraph 6, and 6.7.8 paragraph 10. Here are excerpts showing these four occurrences (all of which refer to objects with automatic storage duration): The initial value of the object is indeterminate. [if an object had no initializer] the value becomes indeterminate each time the declaration is reached. The initial value of the object is indeterminate. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. Although these passages use different phrasing, it seems clear they are meant to mirror the parenthetical phrase in 6.8 p3, "storing an indeterminate value in objects without an initializer"; presumably the difference in phrasing simply reflects the styles of the respective sections: 6.8 gives an imperative description, whereas 6.2.4 and 6.7 tend to be more declarative in style. (The last of these excerpts matches word-for-word with the analogous sentence in C90.) That the C99 standard considers these five passages as expressing the same idea can be seen by them all being referenced in a single entry given in Annex J part 2: The value of an object with automatic storage duration is used while it is indeterminate (6.2.4, 6.7.8, 6.8). Compare this text with the corresponding entry in C90. One reason for the difference is that in C99, unlike in C90, an object can become "unassigned" after it is first assigned (which is a consequence in C99 of being able to mix declarations and statements). So rather than say "before a value is assigned" the C99 standard says "while it is indeterminate". The one other place where the word "indeterminate" is used without being followed by "value" is in 6.2.4 paragraph 2: The value of a pointer becomes indeterminate when the object it points to reaches the end of its lifetime. (The analogous sentence in C90 says basically the same but using different phrasing, partly because C90 doesn't have any explicit definition of "lifetime", which of course C99 does.) There is a corresponding entry for this passage in Annex J part 2 (and which actually doesn't use the word indeterminate): The value of a pointer to an object whose lifetime has ended is used (6.2.4). There is a subtle but important difference between this rule and the other passages mentioned above. In all of the other cases there is a specific object being referenced. In the rule here, we aren't talking about a particular object, nor even just one object necessarily (there could be many), but possibly about values that aren't in an object at all. Consider this code fragment: char *p = malloc( 1 ); char *q = p + (free(p),0); It seems clear that the second line is meant to be undefined behavior /even if the (leftmost) access of p has already taken place before the call to free() is done/. It isn't an access to an object (whether indeterminate or not) that is causing the problem. Rather, it is the use of a value -- valid at the time the value was obtained -- that has been rendered /invalid/ between the time the value was loaded from p and the time the value is used in a '+' operation. Of course, we all understand what's really going on here. In real computer hardware, the bits of a pointer value don't magically change when a free() is done (or when an object goes out of scope and its lifetime ends, etc). Instead, the bits stay the same, but whether the bits are meaningful or not (or whether they have the same meaning as before) depends on the state of the "memory system" as a whole. The term "memory system" is in quotes because it is meant to include not just state in the actual hardware but also assumptions made by the compiled code; a pointer to memory in a departed stack frame may be perfectly fine as far as the hardware is concerned, but it violates an assumption made by the compiler that the associated memory may be (or already have been) reused for another purpose. One problem with this understanding is that it isn't amenable to being expressed in the language of the abstract machine. So C99 glosses over the problem by saying "the value of a pointer becomes indeterminate when ...", disregards what the definition of "indeterminate value" says, and then pretends (in Annex J.2) that using any such value is undefined behavior. The text in the standard is very clear: reading a trap representation is always undefined behavior (unless accessed using a character type). There is nothing in the normative text of the standard that says accessing an indeterminate value is undefined behavior. In fact, if we take the text of the standard at its word, /every/ object has an indeterminate value, because every object representation is either a valid value or a trap representation. If we ignore pointer types we have an answer to our question: any type that has no trap representations never causes undefined behavior by being accessed. Then why does the entry in Annex J.2 give a blanket statement that any use is undefined behavior? A reasonable guess is that entries in Annex J are meant to provide useful shorthands without necessarily being completely accurate (consider for example that the exception for access done using a character type is not mentioned in the Annex J.2 entry -- a clear omission). There is more to say about pointer types. Considering how long this memo is already it seems better to defer that to a separate posting. Next we look at C11. With respect to the question being considered, the C11 standard is almost exactly the same as the C99 standard. There are two differences. First, there is a cosmetic change in that the term "trap representation" is given a summary definition in section 3.19.4; the paragraph in 6.2.6 where "trap representation" was previously defined in C99 is unchanged except that in C11 there are no italics. The second difference is not a revision but an addition. In section 6.3.2.1 paragraph 2, talking about lvalue conversion, one sentence has been added at the end of the paragraph: If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined. Naturally there is a corresponding entry that has been added to Annex J.2: An lvalue designating an object of automatic storage duration that could have been declared with the register storage class is used in a context that requires the value of the designated object, but the object is uninitialized. (6.3.2.1). The motivation for this new rule reportedly reflects hardware behavior, on some more recent chips, for some stack-allocated variables. The added text has several points worth noting. One, the rule adds a specific, narrow case of undefined behavior that is simple and clearly delineated. Two, it does not use the term "indeterminate" or "indeterminate value". Instead the rule is written in terms of initialization and assignment. By avoiding "indeterminate", it avoids any uncertainty about whether undefined behavior must result from using an indeterminate value. Three, it provides indirect evidence that use of an indeterminate value is not necessarily undefined behavior, because if it were then this new rule would not be necessary. Four, the condition of undefined behavior is expressed using imperative phrasing: what matters is what has been done, or not done, to the object in question. This choice makes this rule a supplement, not a replacement, for 6.8 p3 et al. Consider this example function definition: double example( double in ){ unsigned yet = 0; redux: ; double d; if( !yet ){ d = in; yet++; goto redux; } return d; } The use of 'd' in 'return d;' might give undefined behavior, because 'd' may have a trap representation under 6.8 p3. But the code doesn't violate the conditions of 6.3.2.1 p2, because an assignment has been done before the lvalue conversion in the final statement; the intervening evaluation of 'double d;' doesn't change that. Note also that the clause in 6.8 p3 for such declarations, "storing an indeterminate value in objects without an initializer", does not interfere with the application of the rule in 6.3.2.1 p2, because that rule is written in terms of assignment, and not in terms of storing a value (which may have been done because of the parenthetical phrase in 6.8 p3). After C11 I have not taken the time to review at the C17 standard or the C23 draft standard while researching the topic here. I see that some changes have been made (such as "non-value representation" for "trap representation"), but to the best of my knowledge none of the key passages are substantively different. I may check on that later (but no promises on when or whether). Summary: my reading is that accessing an object that has not been explicitly stored into since its declaration was evaluated is necessarily undefined behavior in C90, but not necessarily undefined behavior in C99 and C11 (and AFAIAA also in C17 and the upcoming C23). My reasoning is given in detail above. Postscript: this commentary has taken much longer to write than I thought it would, for the most part because I made an early decision to be systematic and thorough. I hope the effort has helped the readers gain confidence in the explanations and conclusions stated. I may return to the deferred topic about pointer types but have no plans at present about when that might be.