Path: csiph.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: David Brown Newsgroups: comp.compilers Subject: Re: Bounds checking, Optimization techniques and undefined behavior Date: Mon, 6 May 2019 16:32:20 +0200 Organization: A noiseless patient Spider Lines: 189 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <19-05-040@comp.compilers> References: <19-04-021@comp.compilers> <19-04-023@comp.compilers> <19-04-037@comp.compilers> <19-04-039@comp.compilers> <19-04-042@comp.compilers> <19-04-044@comp.compilers> <19-04-047@comp.compilers> <19-05-004@comp.compilers> <19-05-006@comp.compilers> <19-05-016@comp.compilers> <19-05-020@comp.compilers> <19-05-024@comp.compilers> <19-05-025@comp.compilers> <19-05-028@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="38641"; mail-complaints-to="abuse@iecc.com" Keywords: C, errors, comment Posted-Date: 06 May 2019 10:54:00 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Content-Language: en-GB Xref: csiph.com comp.compilers:2275 On 05/05/2019 12:14, Bart wrote: > On 04/05/2019 10:45, Andy Walker wrote: >> On 03/05/2019 23:10, Bart wrote: > >>> C is just a mess; it has arrays of sorts, but people generally use raw >>> pointers without associated bounds. Maybe that's one reason why your C >>> didn't have it. Or did it somehow manage it if enabled? >> >>      This isn't really a problem with C, the language.  It's clear in >> the reference manual right back to K&R C and in the various standards >> that pointers always have associated bounds. > > But how do they get there? Take this: > >     int A[10], *p; >     p = &A[3]; > > You intend p to refer to the 4-element slice A[3..6], but how does the > language know that? How can it stop code from writing to p[5]? C does not have types representing slices. And even its array types are a bit limited - decaying to pointers when they are used as lvalues in expressions, and thus losing their array properties (like size). This pointer decay is very convenient in most uses of arrays, but it does mean there is less checking of them and you can't copy them by value. You can wrap an array in a struct to get a fixed array. So a compiler (or additional static error checker) can stop you from writing p[7], but not from writing p[5]. > > Or you intend to index p[-2] to get at the preceding elements. Again, p[-2] is valid (according to the rules of the language), but p[-4] is not. gcc (and some other compilers) can warn you here. > Actually > using negative indexing is quite common, but surely all array bounds in > C are presumed to start from 0? That is true for arrays. But the [] operation is not an operation on arrays - it is an operation on pointers. As long as "p" points to a compatibly typed item within an object (like an array), and "p[x]" points to another compatibly typed item within the same object, then "p[x]" is valid and means the same as "*(p + x)". Using negative indexing is very /uncommon/ in C, but it does occur and is valid (if used correctly, of course). > > Or this: > >   struct {int a,b,c,d;} S; > >   p = &S.a; > > You intend p to be used to access a,b,c,d as an int[4] array, but p's > bounds will say it's only one element long. > In C, a struct of 4 int's and an array of 4 ints are not the same thing. > Or this: > >    int *p = malloc(sizeof(int)*1000); > >    int *q = p+400; > > You are allocating one pool of memory than sub-allocating that into > smaller objects, here into a 20-element array headed by q. But how does > the language know that? It doesn't know it. C can't express everything you might want to say - nor can any language. Sometimes it is possible to say a bit more, but with awkward and inconvenient syntax - which may or may not be worth the effort. Some static analysers (like pc-lint) can let you give more information in the form of special comments, and then they can spot more bugs. > > Or maybe p allocates memory for a 2D array, and q refers to one row of > that; you can't detect accesses beyond that row. > Compilers might be able to spot problems in some cases, and static or run-time analysers might be able to spot others. No language can stop you making any mistakes, and no tool can spot all errors. C does not have every feature that a language could have - it is intended to have relatively few safety features, and for the programmer to take responsibility for writing correct code. > I can see how the language can figure out the overall bounds of the > block of memory a pointer refers to. But that can be a long way from > knowing the precise bounds of any array or sub-array, when represented > via a pointer to the first element. That is not a feature supported by C. > > So the problems are either that bounds transgressions can't be detected > properly as 'fat pointers' only have broad info about allocations, or > they will give false alarms when you want to do some pointer-offset > manipulations that C normally allows. C does not normally use "fat pointers" at all. But if you want to use fat pointers, you can make your own struct types holding them. And you are free to make your own slice and sub-array types from them, with whatever run-time checking you want. Their usage is going to be a bit ugly in C, since you can't have operator overloading or other neat syntax (got to C++ if you want that), but it will work. C only gives you the low-level tools here - if you want something more advanced and higher level, you need to make it yourself. > >   It's just that the K&R >> compiler with Unix didn't check, perhaps for decent reasons, and that >> the relatively few attempts to produce compilers that do have not been >> all that successful.  You can't do the checking unless every pointer >> "knows" which object it is supposed to be pointing into, which rather >> implies "fat" pointers, which makes all pointer operations take longer >> and require more code.  It very likely means that arrays too need to >> be implemented differently, with proper descriptors. > > Which C is not going to have without breaking just about every existing > program. Besides, most array work in C doesn't actually use arrays, but > explicit pointers. > C implementations don't (usually) use fat pointers, for efficiency reasons. But a compiler can track the provenance of pointers and do a good deal of static analysis on them. The C and C++ committees are both in the process of considering formal definitions of pointer provenance to make this clearer, and hopefully compilers will add more static checking of them. With link-time optimisation allowing more swapping of information between modules, this will get a good deal more powerful in the future. >   As discussed over >> in "comp.lang.misc" recently, that's not a Bad Thing;  it gives you >> extra facilities virtually free, as well as the added security.  But >> it does result in larger executables and slower execution -- unless you >> really do have hardware support [cf Burroughs] -- so historically it's >> not been popular. > > With language support, it need have no cost. For example, suppose that > array A did carry its bounds with it (or are statically known), then in > code like this: > >     for i in A do       # (iterate over bounds not values) >         A[i] := 0 >     end > > the compiler knows it doesn't need to bounds-check each access. Or here: That works in C++, but not C. (Yes, it would be nice to have it in C. Certainly for any new programming language I would hope to see something like that.) A key problem with languages that have "proper" arrays is that it can be very difficult to handle array parameters, and functions that will handle arrays of different sizes. Array types have their size as part of their type, so a function that takes an array of 10 ints has a different type signature from one that takes an array of 20 ints. C requires you to pass this size information in another way, such as via an extra parameter - you get the flexibility to choose your solution, but the responsibility to implement it yourself. Some languages can provide "hidden parameters" to handle the situation, which would make certain kinds of checks easier, and is not a bad idea. > >     forall x in A do    # (iterate over values) >         print x >     end > > [Every pointer in C points either to a static object or to one that > comes from a handful of routines like malloc() or mmap() so it > shouldn't be too hard to do bounds checking within the allocated > object, give or take unsafe typecasts which I don't think are common > in application code.  But you are certainly correct that C isn't > expressive enough to do slices or the like. -John] Pointers can also point to non-static local data ("stack" data, in most implementations). [Oh, of course. But the bounds of those are known, too. -John]