Groups > comp.compilers > #136 > unrolled thread

Dealing with load/store instructions on static tainted flow analysis

Started by	Gabriel Quadros <gabrielquadros@hotmail.com>
First post	2011-06-06 21:00 -0700
Last post	2011-06-12 12:11 +0100
Articles	5 — 5 participants

Back to article view | Back to comp.compilers

  Dealing with load/store instructions on static tainted flow analysis Gabriel Quadros <gabrielquadros@hotmail.com> - 2011-06-06 21:00 -0700
    Re: Dealing with load/store instructions on static tainted flow analysis glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-07 09:04 +0000
    Re: Dealing with load/store instructions on static tainted flow analysis kym@kymhorsell.com - 2011-06-08 07:53 +0000
    Re: Dealing with load/store instructions on static tainted flow analysis George Neuner <gneuner2@comcast.net> - 2011-06-09 18:51 -0400
    Re: Dealing with load/store instructions on static tainted flow analysis Martin Ward <martin@gkc.org.uk> - 2011-06-12 12:11 +0100

#136 — Dealing with load/store instructions on static tainted flow analysis

From	Gabriel Quadros <gabrielquadros@hotmail.com>
Date	2011-06-06 21:00 -0700
Subject	Dealing with load/store instructions on static tainted flow analysis
Message-ID	<11-06-010@comp.compilers>

Dear guys,

    I am trying to implement a pass to detect information leak in
programs.  The problem is a variation of static tainted-flow analysis:
I have some source functions, sink functions and sanitizers. I want to
know if it is possible for data to flow from source to sink without
going across a sanitizer.

    I am using LLVM, and I am analyzing the LLVM bitcodes. My pass is
working well, but I am having some issues with memory. Once
information flows to the heap, it is hard to know how it propagates to
the rest of the program.  Example:

a = SOURCE
b = malloc(100)
...
b[i] = a
...
SINK = c[j]
...

So, the problem is that it is hard to know that c != b and i != j.
Once information flows into memory, the safest thing to do is to flag
the whole memory as a SOURCE. Of course, that is very conservative. I
was wondering if you guys could recommend me some strategies and
techniques to be more precise. In particular, if you could point me
some paper that does it, that would be great.

My best regards,
Gabriel.

[toc] | [next] | [standalone]

#139

From	glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date	2011-06-07 09:04 +0000
Message-ID	<11-06-013@comp.compilers>
In reply to	#136

Gabriel Quadros <gabrielquadros@hotmail.com> wrote:

>    I am trying to implement a pass to detect information leak in
> programs.  The problem is a variation of static tainted-flow analysis:
> I have some source functions, sink functions and sanitizers. I want to
> know if it is possible for data to flow from source to sink without
> going across a sanitizer.
(snip)
> In particular, if you could point me some paper that does it,
> that would be great.

It isn't exactly the same, but I would start looking at the Java
class verifier.

Well, for one Java requires bounds checking, so you can be sure
that only references to the same array would leak.  Java also
requires the verifier to detect references that load/store the
wrong data type, such as treating a double as two ints.
(I believe it detects both stack and heap accesses.)

-- glen

[toc] | [prev] | [next] | [standalone]

#140

From	kym@kymhorsell.com
Date	2011-06-08 07:53 +0000
Message-ID	<11-06-014@comp.compilers>
In reply to	#136

Gabriel Quadros <gabrielquadros@hotmail.com> wrote:
...
> So, the problem is that it is hard to know that c != b and i != j.
> Once information flows into memory, the safest thing to do is to flag
> the whole memory as a SOURCE. Of course, that is very conservative. I
> was wondering if you guys could recommend me some strategies and
> techniques to be more precise. In particular, if you could point me
> some paper that does it, that would be great.
...

Something that may be of use.

You can create a hash value for address expressions and memory
accesses using a simple hash scheme.

Each constant in an expression is represented by itself.
Each variable is assigned a random number.
All operations + * / and performed modulo p (a nice big prime; there are
better choices if you're working with mixed integer and real expressions
e.g. if you want i**2 and exp(i*pi) to hash to -1 (i.e. p-1)).

Assignments copy the hash from one variable to another.

A pointer access can be modeled by generating a PRN using the hash
of the address as the seed.

If 2 hash values are different there is a strong probability the 2
expressions are not identical.

Of course, put this under an optimisation and give warnings in the manual. :)

[toc] | [prev] | [next] | [standalone]

#142

From	George Neuner <gneuner2@comcast.net>
Date	2011-06-09 18:51 -0400
Message-ID	<11-06-016@comp.compilers>
In reply to	#136

On Mon, 6 Jun 2011 21:00:41 -0700 (PDT), Gabriel Quadros
<gabrielquadros@hotmail.com> wrote:

>    I am trying to implement a pass to detect information leak in
>programs.  The problem is a variation of static tainted-flow analysis:
>I have some source functions, sink functions and sanitizers. I want to
>know if it is possible for data to flow from source to sink without
>going across a sanitizer.
>  :
>a = SOURCE
>b = malloc(100)
>...
>b[i] = a
>...
>SINK = c[j]
>...
>
>So, the problem is that it is hard to know that c != b and i != j.
>Once information flows into memory, the safest thing to do is to flag
>the whole memory as a SOURCE. Of course, that is very conservative. I
>was wondering if you guys could recommend me some strategies and
>techniques to be more precise. In particular, if you could point me
>some paper that does it, that would be great.

Hi Gabriel,

A general solution, I believe, is intractable.

Using SSA or some kind of value numbering, it theoretically would be
possible - though not easy - to keep track of all generated non-local
addresses and compare the addresses accessed via disjoint program
paths to see if there is overlap.

Unfortunately, there are (at least) 4 problems with that.  The first
is that the addresses the compiler works with are symbolic: a name
plus offset ... and they may be more involved to store and compare
than a simple numerical address.

The second, related, problem is that global and heap addresses often
will be of a form like "$DATA+0xFA3892", having no reference to any
recognizable program data ... you'll need to track or to figure out
the boundaries of your globals and heap allocations in terms of the
compiler's symbolic addresses in order to give meaningful error
messages.

The third problem is you'll need to process the entire program before
looking for leaks.  The storage needed to retain all the address lists
may be extremely large and if the language permits separate
compilation, you'll need to ensure that all of the program has been
processed.

And the fourth problem is identifying the disjoint program paths. This
can be done using any of the number of flow and dependency graphs that
the compiler generates ... but the compiler is more interested in
following individual serial execution paths and generally isn't much
interested in the extent of multiple path disjointness.

WRT literature, I'm not familiar with any dealing precisely with the
kind of *internal* "hidden channel" security issues you are addressing
... there is quite a bit about leakage outside of a program.

The best I can suggest is that you investigate array optimizations, in
particular looking at the extensive work that has been done toward
identifying array index aliasing.  Your particular problem, though
more expansive, is directly analogous to the array aliasing problem
(treating all of process memory as an array).

George

[toc] | [prev] | [next] | [standalone]

#143

From	Martin Ward <martin@gkc.org.uk>
Date	2011-06-12 12:11 +0100
Message-ID	<11-06-017@comp.compilers>
In reply to	#136

On Tuesday 07 Jun 2011 at 05:00, Gabriel Quadros <gabrielquadros@hotmail.com>
wrote:
>     I am using LLVM, and I am analyzing the LLVM bitcodes. My pass is
> working well, but I am having some issues with memory. Once
> information flows to the heap, it is hard to know how it propagates to
> the rest of the program.  Example:
>
> a = SOURCE
> b = malloc(100)
> ...
> b[i] = a
> ...
> SINK = c[j]
> ...
>
> So, the problem is that it is hard to know that c != b and i != j.
> Once information flows into memory, the safest thing to do is to flag
> the whole memory as a SOURCE. Of course, that is very conservative. I
> was wondering if you guys could recommend me some strategies and
> techniques to be more precise. In particular, if you could point me
> some paper that does it, that would be great.

I suggest that you look at the research on "points-to analysis" with regions:
this attempts to divide the memory of the machine into disjoint regions
where each pointer can be guaranteed to address one or more of these regions.
As long as the SOURCE pointers are in different regions to the SINK pointers,
you should be OK.

A possible starting point:
"Putting Pointer Analysis to Work" (1998)
by Rakesh Ghiya ,  Laurie J. Hendren
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.9725

--
			Martin

STRL Reader in Software Engineering and Royal Society Industry Fellow
martin@gkc.org.uk  http://www.cse.dmu.ac.uk/~mward/  Erdos number: 4
G.K.Chesterton web site: http://www.cse.dmu.ac.uk/~mward/gkc/
Mirrors:  http://www.gkc.org.uk  and  http://www.gkc.org.uk/gkc

[toc] | [prev] | [standalone]

csiph-web

Dealing with load/store instructions on static tainted flow analysis

Contents

#136 — Dealing with load/store instructions on static tainted flow analysis

#139

#140

#142

#143