Path: csiph.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Andy Walker Newsgroups: comp.compilers Subject: Re: Bounds checking, Optimization techniques and undefined behavior Date: Mon, 6 May 2019 01:15:53 +0100 Organization: Not very much Lines: 135 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <19-05-032@comp.compilers> References: <19-04-021@comp.compilers> <19-04-023@comp.compilers> <19-04-037@comp.compilers> <19-04-039@comp.compilers> <19-04-042@comp.compilers> <19-04-044@comp.compilers> <19-04-047@comp.compilers> <19-05-004@comp.compilers> <19-05-006@comp.compilers> <19-05-016@comp.compilers> <19-05-020@comp.compilers> <19-05-024@comp.compilers> <19-05-025@comp.compilers> <19-05-028@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="20891"; mail-complaints-to="abuse@iecc.com" Keywords: debug, design Posted-Date: 05 May 2019 21:54:52 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Content-Language: en-GB Xref: csiph.com comp.compilers:2267 On 05/05/2019 11:14, Bart wrote: >>> C is just a mess; it has arrays of sorts, but people generally use raw >>> pointers without associated bounds. Maybe that's one reason why your C >>> didn't have it. Or did it somehow manage it if enabled? >>      This isn't really a problem with C, the language.  It's clear in >> the reference manual right back to K&R C and in the various standards >> that pointers always have associated bounds. > But how do they get there? Take this: >     int A[10], *p; >     p = &A[3]; > You intend p to refer to the 4-element slice A[3..6], but how does the > language know that? How can it stop code from writing to p[5]? You may intend that, but C doesn't. C knows, after that code, that "p" points within "A" and so can be bounds-checked that "p+i" is valid for, and only for, A <= p+i <= A+10 [and for dereferencing only if p+i < A+10. So for bounds checking you need "fat pointers" and the fat pointer that is "p" must contain the bounds information required to check that; IOW, A, p-A and 10, or near equivalent. > Or you intend to index p[-2] to get at the preceding elements. Actually > using negative indexing is quite common, but surely all array bounds in > C are presumed to start from 0? Arrays may start there, but pointers are not arrays. > Or this: >   struct {int a,b,c,d;} S; >   p = &S.a; > You intend p to be used to access a,b,c,d as an int[4] array, but p's > bounds will say it's only one element long. Again, you may intend that, but C doesn't. AFAICT, nothing in the C standard [any version] encourages the belief that your "S" is an array. > Or this: >    int *p = malloc(sizeof(int)*1000); >    int *q = p+400; > You are allocating one pool of memory than sub-allocating that into > smaller objects, here into a 20-element array headed by q. But how does > the language know that? It doesn't. You seem to be building much more into your C code that C actually allows. You have allocated a pool of 1000 integers, and then made "q" point into that pool. That's all. If you now write [say] "q--;", or "q[100] = 0;" then the bounds checking will reveal that "q" is still within the pool, but "q[-500] = q[603] = 0;" [eg] is UB, which, with a compiler that aborts on illegal accesses will cause your program to fail. > Or maybe p allocates memory for a 2D array, and q refers to one row of > that; you can't detect accesses beyond that row. Of course you can, at least when they go beyond the bounds of the allocated memory. > I can see how the language can figure out the overall bounds of the > block of memory a pointer refers to. But that can be a long way from > knowing the precise bounds of any array or sub-array, when represented > via a pointer to the first element. The bounds of the object that "q" [in your example] refers to are all that is needed for an "array-bound check". > So the problems are either that bounds transgressions can't be detected > properly as 'fat pointers' only have broad info about allocations, or > they will give false alarms when you want to do some pointer-offset > manipulations that C normally allows. "C normally allows". What you mean seems to be that compilers that don't check such things [the usual case] don't detect the UB, and so either do what you intended or ignore the bug. Compilers are free to ignore UB, but it's still incorrect code, and you would have no right to complain if your program emitted the traditional nasal demons. >> [...] It very likely means that arrays too need to >> be implemented differently, with proper descriptors. > Which C is not going to have without breaking just about every existing > program. Besides, most array work in C doesn't actually use arrays, but > explicit pointers. It would break programs with [perhaps undetected] bugs. [I say "perhaps" because you may have decided that a technical bug in fact works with a particular program and compiler, and so that the bug does not need to be corrected. But you have, again, no right to complain if that view comes back to bite you a few years down the line.] [...] > With language support, it need have no cost. For example, suppose that > array A did carry its bounds with it (or are statically known), then in > code like this: >     for i in A do       # (iterate over bounds not values) >         A[i] := 0 >     end > the compiler knows it doesn't need to bounds-check each access. Or here: >     forall x in A do    # (iterate over values) >         print x >     end Well, of course [though you would need guarantees that "i" and "x" couldn't change inside the loop]. But that would be a much bigger change to C than merely checking that pointers stay within bounds, already a part [though commonly ignored] of the language. Note too that the checking can equally be zapped, under optimisation, for many of the common idioms for looping around arrays. [Moderator:] > [Every pointer in C points either to a static object or to one that > comes from a handful of routines like malloc() or mmap() so it > shouldn't be too hard to do bounds checking within the allocated > object, give or take unsafe typecasts which I don't think are common > in application code. "Unsafe typecasts" are already UB! > But you are certainly correct that C isn't > expressive enough to do slices or the like. -John] Um. There should be no difficulty in defining library procedures for the purpose [not strictly in C, of course, but lots of library stuff has to be written for a specific compiler/OS/computer]. You would need procedures that did clever things to fat pointers; they could default to doing unclever things when invoked on systems without fat pointers. So we could say that C doesn't currently have slices, but that's not *quite* the same as saying it's not expressive enough. -- Andy Walker, Nottingham. [In the struct { int a,b,c,d; } S example it is my understanding that &S and &S.a have to be the same, the four ints have to be in the order declared, and they all have the same alignment so it is valid abeit poor form to treat them as an array. -John]