Path: csiph.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.compilers
Subject: Re: Bounds checking, Optimization techniques and undefined behavior
Date: Mon, 6 May 2019 16:32:20 +0200
Organization: A noiseless patient Spider
Lines: 189
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <19-05-040@comp.compilers>
References: <19-04-021@comp.compilers> <19-04-023@comp.compilers> <19-04-037@comp.compilers> <19-04-039@comp.compilers> <19-04-042@comp.compilers> <19-04-044@comp.compilers> <19-04-047@comp.compilers> <19-05-004@comp.compilers> <19-05-006@comp.compilers> <19-05-016@comp.compilers> <19-05-020@comp.compilers> <19-05-024@comp.compilers> <19-05-025@comp.compilers> <19-05-028@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="38641"; mail-complaints-to="abuse@iecc.com"
Keywords: C, errors, comment
Posted-Date: 06 May 2019 10:54:00 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-GB
Xref: csiph.com comp.compilers:2275

On 05/05/2019 12:14, Bart wrote:
> On 04/05/2019 10:45, Andy Walker wrote:
>> On 03/05/2019 23:10, Bart wrote:
>
>>> C is just a mess; it has arrays of sorts, but people generally use raw
>>> pointers without associated bounds. Maybe that's one reason why your C
>>> didn't have it. Or did it somehow manage it if enabled?
>>
>>      This isn't really a problem with C, the language.  It's clear in
>> the reference manual right back to K&R C and in the various standards
>> that pointers always have associated bounds.
>
> But how do they get there? Take this:
>
>     int A[10], *p;
>     p = &A[3];
>
> You intend p to refer to the 4-element slice A[3..6], but how does the
> language know that? How can it stop code from writing to p[5]?

C does not have types representing slices.  And even its array types are
a bit limited - decaying to pointers when they are used as lvalues in
expressions, and thus losing their array properties (like size).  This
pointer decay is very convenient in most uses of arrays, but it does
mean there is less checking of them and you can't copy them by value.
You can wrap an array in a struct to get a fixed array.

So a compiler (or additional static error checker) can stop you from
writing p[7], but not from writing p[5].

>
> Or you intend to index p[-2] to get at the preceding elements.

Again, p[-2] is valid (according to the rules of the language), but
p[-4] is not.  gcc (and some other compilers) can warn you here.

> Actually
> using negative indexing is quite common, but surely all array bounds in
> C are presumed to start from 0?

That is true for arrays.  But the [] operation is not an operation on
arrays - it is an operation on pointers.  As long as "p" points to a
compatibly typed item within an object (like an array), and "p[x]"
points to another compatibly typed item within the same object, then
"p[x]" is valid and means the same as "*(p + x)".


Using negative indexing is very /uncommon/ in C, but it does occur and
is valid (if used correctly, of course).

>
> Or this:
>
>   struct {int a,b,c,d;} S;
>
>   p = &S.a;
>
> You intend p to be used to access a,b,c,d as an int[4] array, but p's
> bounds will say it's only one element long.
>

In C, a struct of 4 int's and an array of 4 ints are not the same thing.

> Or this:
>
>    int *p = malloc(sizeof(int)*1000);
>
>    int *q = p+400;
>
> You are allocating one pool of memory than sub-allocating that into
> smaller objects, here into a 20-element array headed by q. But how does
> the language know that?

It doesn't know it.

C can't express everything you might want to say - nor can any language.
 Sometimes it is possible to say a bit more, but with awkward and
inconvenient syntax - which may or may not be worth the effort.

Some static analysers (like pc-lint) can let you give more information
in the form of special comments, and then they can spot more bugs.

>
> Or maybe p allocates memory for a 2D array, and q refers to one row of
> that; you can't detect accesses beyond that row.
>

Compilers might be able to spot problems in some cases, and static or
run-time analysers might be able to spot others.

No language can stop you making any mistakes, and no tool can spot all
errors.  C does not have every feature that a language could have - it
is intended to have relatively few safety features, and for the
programmer to take responsibility for writing correct code.


> I can see how the language can figure out the overall bounds of the
> block of memory a pointer refers to. But that can be a long way from
> knowing the precise bounds of any array or sub-array, when represented
> via a pointer to the first element.

That is not a feature supported by C.

>
> So the problems are either that bounds transgressions can't be detected
> properly as 'fat pointers' only have broad info about allocations, or
> they will give false alarms when you want to do some pointer-offset
> manipulations that C normally allows.

C does not normally use "fat pointers" at all.  But if you want to use
fat pointers, you can make your own struct types holding them.  And you
are free to make your own slice and sub-array types from them, with
whatever run-time checking you want.  Their usage is going to be a bit
ugly in C, since you can't have operator overloading or other neat
syntax (got to C++ if you want that), but it will work.  C only gives
you the low-level tools here - if you want something more advanced and
higher level, you need to make it yourself.

>
>   It's just that the K&R
>> compiler with Unix didn't check, perhaps for decent reasons, and that
>> the relatively few attempts to produce compilers that do have not been
>> all that successful.  You can't do the checking unless every pointer
>> "knows" which object it is supposed to be pointing into, which rather
>> implies "fat" pointers, which makes all pointer operations take longer
>> and require more code.  It very likely means that arrays too need to
>> be implemented differently, with proper descriptors.
>
> Which C is not going to have without breaking just about every existing
> program. Besides, most array work in C doesn't actually use arrays, but
> explicit pointers.
>

C implementations don't (usually) use fat pointers, for efficiency
reasons.  But a compiler can track the provenance of pointers and do a
good deal of static analysis on them.  The C and C++ committees are both
in the process of considering formal definitions of pointer provenance
to make this clearer, and hopefully compilers will add more static
checking of them.  With link-time optimisation allowing more swapping of
information between modules, this will get a good deal more powerful in
the future.

>   As discussed over
>> in "comp.lang.misc" recently, that's not a Bad Thing;  it gives you
>> extra facilities virtually free, as well as the added security.  But
>> it does result in larger executables and slower execution -- unless you
>> really do have hardware support [cf Burroughs] -- so historically it's
>> not been popular.
>
> With language support, it need have no cost. For example, suppose that
> array A did carry its bounds with it (or are statically known), then in
> code like this:
>
>     for i in A do       # (iterate over bounds not values)
>         A[i] := 0
>     end
>
> the compiler knows it doesn't need to bounds-check each access. Or here:

That works in C++, but not C.  (Yes, it would be nice to have it in C.
Certainly for any new programming language I would hope to see something
like that.)

A key problem with languages that have "proper" arrays is that it can be
very difficult to handle array parameters, and functions that will
handle arrays of different sizes.  Array types have their size as part
of their type, so a function that takes an array of 10 ints has a
different type signature from one that takes an array of 20 ints.  C
requires you to pass this size information in another way, such as via
an extra parameter - you get the flexibility to choose your solution,
but the responsibility to implement it yourself.  Some languages can
provide "hidden parameters" to handle the situation, which would make
certain kinds of checks easier, and is not a bad idea.

>
>     forall x in A do    # (iterate over values)
>         print x
>     end
>
> [Every pointer in C points either to a static object or to one that
> comes from a handful of routines like malloc() or mmap() so it
> shouldn't be too hard to do bounds checking within the allocated
> object, give or take unsafe typecasts which I don't think are common
> in application code.  But you are certainly correct that C isn't
> expressive enough to do slices or the like. -John]

Pointers can also point to non-static local data ("stack" data, in most
implementations).
[Oh, of course.  But the bounds of those are known, too. -John]