Path: csiph.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Andy Walker <anw@cuboid.co.uk>
Newsgroups: comp.compilers
Subject: Re: Bounds checking, Optimization techniques and undefined behavior
Date: Mon, 6 May 2019 01:15:53 +0100
Organization: Not very much
Lines: 135
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <19-05-032@comp.compilers>
References: <19-04-021@comp.compilers> <19-04-023@comp.compilers> <19-04-037@comp.compilers> <19-04-039@comp.compilers> <19-04-042@comp.compilers> <19-04-044@comp.compilers> <19-04-047@comp.compilers> <19-05-004@comp.compilers> <19-05-006@comp.compilers> <19-05-016@comp.compilers> <19-05-020@comp.compilers> <19-05-024@comp.compilers> <19-05-025@comp.compilers> <19-05-028@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="20891"; mail-complaints-to="abuse@iecc.com"
Keywords: debug, design
Posted-Date: 05 May 2019 21:54:52 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-GB
Xref: csiph.com comp.compilers:2267

On 05/05/2019 11:14, Bart wrote:
>>> C is just a mess; it has arrays of sorts, but people generally use raw
>>> pointers without associated bounds. Maybe that's one reason why your C
>>> didn't have it. Or did it somehow manage it if enabled?
>>      This isn't really a problem with C, the language.  It's clear in
>> the reference manual right back to K&R C and in the various standards
>> that pointers always have associated bounds.
> But how do they get there? Take this:
>      int A[10], *p;
>      p = &A[3];
> You intend p to refer to the 4-element slice A[3..6], but how does the
> language know that? How can it stop code from writing to p[5]?

	You may intend that, but C doesn't.  C knows, after that code,
that "p" points within "A" and so can be bounds-checked that "p+i" is
valid for, and only for, A <= p+i <= A+10 [and for dereferencing only
if p+i < A+10.  So for bounds checking you need "fat pointers" and the
fat pointer that is "p" must contain the bounds information required to
check that;  IOW, A, p-A and 10, or near equivalent.

> Or you intend to index p[-2] to get at the preceding elements. Actually
> using negative indexing is quite common, but surely all array bounds in
> C are presumed to start from 0?

	Arrays may start there, but pointers are not arrays.

> Or this:
>    struct {int a,b,c,d;} S;
>    p = &S.a;
> You intend p to be used to access a,b,c,d as an int[4] array, but p's
> bounds will say it's only one element long.

	Again, you may intend that, but C doesn't.  AFAICT, nothing in
the C standard [any version] encourages the belief that your "S" is an
array.

> Or this:
>     int *p = malloc(sizeof(int)*1000);
>     int *q = p+400;
> You are allocating one pool of memory than sub-allocating that into
> smaller objects, here into a 20-element array headed by q. But how does
> the language know that?

	It doesn't.  You seem to be building much more into your C code
that C actually allows.  You have allocated a pool of 1000 integers, and
then made "q" point into that pool.  That's all.  If you now write [say]
"q--;", or "q[100] = 0;" then the bounds checking will reveal that "q"
is still within the pool, but "q[-500] = q[603] = 0;" [eg] is UB, which,
with a compiler that aborts on illegal accesses will cause your program
to fail.

> Or maybe p allocates memory for a 2D array, and q refers to one row of
> that; you can't detect accesses beyond that row.

	Of course you can, at least when they go beyond the bounds of
the allocated memory.

> I can see how the language can figure out the overall bounds of the
> block of memory a pointer refers to. But that can be a long way from
> knowing the precise bounds of any array or sub-array, when represented
> via a pointer to the first element.

	The bounds of the object that "q" [in your example] refers to
are all that is needed for an "array-bound check".

> So the problems are either that bounds transgressions can't be detected
> properly as 'fat pointers' only have broad info about allocations, or
> they will give false alarms when you want to do some pointer-offset
> manipulations that C normally allows.

	"C normally allows".  What you mean seems to be that compilers
that don't check such things [the usual case] don't detect the UB, and
so either do what you intended or ignore the bug.  Compilers are free to
ignore UB, but it's still incorrect code, and you would have no right to
complain if your program emitted the traditional nasal demons.

>> [...] It very likely means that arrays too need to
>> be implemented differently, with proper descriptors.
> Which C is not going to have without breaking just about every existing
> program. Besides, most array work in C doesn't actually use arrays, but
> explicit pointers.

	It would break programs with [perhaps undetected] bugs.  [I say
"perhaps" because you may have decided that a technical bug in fact works
with a particular program and compiler, and so that the bug does not need
to be corrected.  But you have, again, no right to complain if that view
comes back to bite you a few years down the line.]

[...]
> With language support, it need have no cost. For example, suppose that
> array A did carry its bounds with it (or are statically known), then in
> code like this:
>      for i in A do       # (iterate over bounds not values)
>          A[i] := 0
>      end
> the compiler knows it doesn't need to bounds-check each access. Or here:
>      forall x in A do    # (iterate over values)
>          print x
>      end

	Well, of course [though you would need guarantees that "i" and "x"
couldn't change inside the loop].  But that would be a much bigger change
to C than merely checking that pointers stay within bounds, already a part
[though commonly ignored] of the language.  Note too that the checking can
equally be zapped, under optimisation, for many of the common idioms for
looping around arrays.

[Moderator:]
> [Every pointer in C points either to a static object or to one that
> comes from a handful of routines like malloc() or mmap() so it
> shouldn't be too hard to do bounds checking within the allocated
> object, give or take unsafe typecasts which I don't think are common
> in application code.

	"Unsafe typecasts" are already UB!

>			But you are certainly correct that C isn't
> expressive enough to do slices or the like. -John]

	Um.  There should be no difficulty in defining library procedures
for the purpose [not strictly in C, of course, but lots of library stuff
has to be written for a specific compiler/OS/computer].  You would need
procedures that did clever things to fat pointers;  they could default to
doing unclever things when invoked on systems without fat pointers.  So
we could say that C doesn't currently have slices, but that's not *quite*
the same as saying it's not expressive enough.

--
Andy Walker,
Nottingham.
[In the struct { int a,b,c,d; } S example it is my understanding that &S and &S.a
have to be the same, the four ints have to be in the order declared, and they
all have the same alignment so it is valid abeit poor form to treat them as
an array. -John]