Groups > comp.compilers > #141 > unrolled thread

Inverse grep

Started by	glen herrmannsfeldt <gah@ugcs.caltech.edu>
First post	2011-06-08 23:01 +0000
Last post	2012-05-28 07:50 -0700
Articles	8 — 6 participants

Back to article view | Back to comp.compilers

  Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-08 23:01 +0000
    Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 14:16 -0400
      Re: Inverse grep Tony Finch <dot@dotat.at> - 2011-06-13 17:48 +0100
    Re: Inverse grep torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-06-14 11:16 +0200
      Re: Inverse grep anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2011-06-15 12:28 +0000
      Matching very large patterns, was Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-19 21:45 -0400
        Re: Matching very large patterns, was Re: Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-20 04:52 +0000
    Re: Inverse grep Paul Wankadia <junyer@gmail.com> - 2012-05-28 07:50 -0700

#141 — Inverse grep

From	glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date	2011-06-08 23:01 +0000
Subject	Inverse grep
Message-ID	<11-06-015@comp.compilers>

I suppose this is a strange question, but I was wondering if
there was ever something like an inverse grep.  That is,
match a string against a file full of regular expressions.

Now, one could just read the file, compile the regex one at
a time, and do the match, but maybe there is another way.

-- glen
[If you want to know which pattern it was, there's flex which turns all
the patterns into one DFA with tags to know which one it was, or else
there's the perl "study" operator which pre-scans a string to make its
NFA matcher faster on subsequent runs against the same string. -John]

[toc] | [next] | [standalone]

#144

From	Chris F Clark <cfc@shell01.TheWorld.com>
Date	2011-06-12 14:16 -0400
Message-ID	<11-06-018@comp.compilers>
In reply to	#141

Actually, most "intrusion detection systems" (IDSes, e.g. Snort) and
virus scanners (e.g.  ClamAV) are essentially just that.  Given a
packet, the contents of the packet are inspected to see which if any
patterns are matched and if so, the relevant pattern [numbers] are
reported.  Generally, the technology used does not actually build a
single pattern from the | of the patterns and run that FA, because
there aren't known efficient solutions to that problem for large
pattern sets.  One either suffers space or time explosion to do so.

Hope this helps,
-Chris

******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
[On machines with gigabyte memories, does the space explosion for
large patterns still matter? -John]

[toc] | [prev] | [next] | [standalone]

#147

From	Tony Finch <dot@dotat.at>
Date	2011-06-13 17:48 +0100
Message-ID	<11-06-021@comp.compilers>
In reply to	#144

Chris F Clark <cfc@shell01.TheWorld.com> wrote:
>
>Actually, most "intrusion detection systems" (IDSes, e.g. Snort) and
>virus scanners (e.g.  ClamAV) are essentially just that.  Given a
>packet, the contents of the packet are inspected to see which if any
>patterns are matched and if so, the relevant pattern [numbers] are
>reported.

Another tool to look at is re2c which can be used with SpamAssassin to
compile its rule regexes.

Tony.
--
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
Faeroes, South-east Iceland: Variable 3 or 4 becoming north or northwest 4 or
5. Moderate. Occasional rain. Moderate or good.

[toc] | [prev] | [next] | [standalone]

#148

From	torbenm@diku.dk (Torben Ægidius Mogensen)
Date	2011-06-14 11:16 +0200
Message-ID	<11-06-022@comp.compilers>
In reply to	#141

glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:

> I suppose this is a strange question, but I was wondering if
> there was ever something like an inverse grep.  That is,
> match a string against a file full of regular expressions.
>
> Now, one could just read the file, compile the regex one at
> a time, and do the match, but maybe there is another way.
>
> [If you want to know which pattern it was, there's flex which turns all
> the patterns into one DFA with tags to know which one it was, or else
> there's the perl "study" operator which pre-scans a string to make its
> NFA matcher faster on subsequent runs against the same string. -John]

The best method depends on how many times you match the same set of
regexps against different strings.  If you use the same set with many
strings, it can pay to preprocess the regexps to, e.g., produce
automata (similar to what flex does), but if you match only a few
times, it is probably better to keep the regexps unchanged and use an
algorithm that matches a string to a regexp without first converting
the regexp to a DFA.  Converting each regexp to an NFA is fairly fast,
so this is a possibility, but if only a small part of each NFA is ever
used, it might be better to use the regexps directly using regular
expression derivatives.  Matching using regular expression derivatives
is generally a bit slower than matching with NFAs, but you don't have
to convert the regexps first.  So if you don't expect to match the set
of regexps to more than a handful of strings, this might be a better
approach.

As Chris said, the DFA for a union of a large number of regexps can
explode, so it is a risky business to use this approach, even if you
are going to match the regexps against many strings.  You can convert
the combined regexps to an NFA and do local reductions to keep the
size down, but the matching time is still nearly the same as matching
the NFAs for the individual regexps in sequence.  A compromise could
be to combine the NFAs from one end and stop when the resulting DFA
becomes too big and then start combining the remaining NFAs from
there.  In the worst case, you don't gain anything over keeping the
NFAs separate, though.

	Torben
[How serious is the explosion problem in practice, now that compilers
no longer have to fit into 64k? We use lex scanners with a thousand
tokens, and spamassassin feeds about 2000 patterns to re2c if you use
the compiled patterns feature. None of them seem to strain our storage
systems. -John]

[toc] | [prev] | [next] | [standalone]

#150

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2011-06-15 12:28 +0000
Message-ID	<11-06-024@comp.compilers>
In reply to	#148

torbenm@diku.dk (Torben Fgidius Mogensen) writes:
>glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:
>
>> I suppose this is a strange question, but I was wondering if
>> there was ever something like an inverse grep.  That is,
>> match a string against a file full of regular expressions. ...

>[How serious is the explosion problem in practice, now that compilers
>no longer have to fit into 64k? We use lex scanners with a thousand
>tokens, and spamassassin feeds about 2000 patterns to re2c if you use
>the compiled patterns feature. None of them seem to strain our storage
>systems. -John]

Even if it was a problem, the solution has been around a long time
(although I am not aware of an academic paper about it): Generate the
DFA on-demand, only for those transitions that are actually used.
Then the worst-case memory demands are O(n) where n is the length of
the input string (and in the usual case the memory usage grows much
slower).

In tree-parsing automata (i.e., not for regexps) I have had cases
where the size of the statically generated automaton was a problem
<https://www.complang.tuwien.ac.at/anton/lazyburg/gforth9.burg>: We
could not generate the automata for the largest tree grammars in
practical time and even had problems compiling some of the code that
we could generate.  OTOH, on-demand generation of automaton states
worked even for the largest tree grammars, and memory consumption was
on the order of 2MB IIRC.  The numbers will be different for regexps,
but I expect the effect to be the same: on-demand automata will work
in many cases where static automata take too long to generate or are
too big to run.  The paper about this work is
<http://www.complang.tuwien.ac.at/papers/ertl+06pldi.ps.gz>

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

[toc] | [prev] | [next] | [standalone]

#159 — Matching very large patterns, was Re: Inverse grep

From	Chris F Clark <cfc@shell01.TheWorld.com>
Date	2011-06-19 21:45 -0400
Subject	Matching very large patterns, was Re: Inverse grep
Message-ID	<11-06-033@comp.compilers>
In reply to	#148

Our moderator asked:
> How serious is the explosion problem in practice, now that compilers
> no longer have to fit into 64k? We use lex scanners with a thousand
> tokens, and spamassassin feeds about 2000 patterns to re2c if you use
> the compiled patterns feature. None of them seem to strain our storage
> systems. -John

For compilers on mainline chips this is generally not a problem.
However, this is such a useful tool, it gets used in areas where the
problems are much-much larger (or the device is smaller).  The
Aho-Corasick algorithm is one of the key algorithms in matching a set
of strings to a large set of incoming data. The two places where I've
seen it used and where size is a problem is networking and
computational biology.

I know less about the computational biology use-cases so I'll dispense
with that first.  As I understand it, the basic idea in DNA and
protein databases is that you have long lists of strings that
represent know genomes or protein foldings, and you match your unknown
DNA or protein against that and see what matches you get.  The number
of known genomes and protein foldings is very large, on the order of
millions.  It is probable that generating the DFA on the fly might
work in that application as Anton suggests, since one can generally
apply a mainstream processor to the problem and one isn't resource
constrained.

On the networking side, there are numerous applications of this
problem. It's the case where one has a hammer, every problem looks
like a nail.  The unix world loves writing tools that match regexes or
sets of strings or some nearly equivalent notation (e.g. ACLs, globs,
URL fragments) to the traffic.

The first case is anti-virus (AV).  The number of virus signatures is
again in the millions.  This is not enough to make the database "too
big" in the sense that it won't fit in memory, but it does make it too
big to fit in cache.  So, although we can construct a DFA that fits in
memory to represent the problem, the resulting DFA doesn't fit in
cache and ends up thrashing the cache.  This adversely effects the
processor performance, as modern processors run many times fater than
main memory speeds and depend on caching to keep that performance
balance.  These AV patterns are effective cache-walkers and any size
of cache that the processor is provisioned with gets exceeded.

In addition, the databasse is growing at a phenomenal rate.  New
viruses are added at the rate of several per hour and that rate is
increaing.  (I believe I've seen that a new virus signature is added
every 9 minutes on average.)  In fact, the rate of new virus detecting
is so fast, that it isn't done by humans any more.  Machine learning
of the viruses from honeypots is becoming the only way to keep up with
the rate of virus growth.  The fact that the hackers are now using
generators to create new unique polymorphic viruses for each attack
makes this fact unsurprising--automation is balanced by automation.

To lessen the dependence on AV, black-listing and white-listing of web
sites has become popular.  However, the numbers of URLs is still
sufficiently large that it presents a similar performance issue.  The
problem in this case is exacerbated by the fact that the list of URLs
is not a static database kept on the client, but instead is gotten
from a web server.  In this case, building the DFA on the fly might be
pratical because the cost is driven by the downloading of the URLs off
the web.

Intrusion detection and prevention systems (IDS/IPS) are a smaller but
similar problem, with a database on the order of 100k patterns with
many of the patterns requiring back-references and other non-DFA
functionality.  Although cache thrashing is still somewhat of a
problem if we solve the problem on the CPU, the bigger issue is that
this problem would like to me moved out of the central processor to
the network edge (i.e. the NIC) so that the traffic is never brought
into the main memory system at all, as even the cost of moving the
traffic onto the memory is a signficant load.

At the edge we experience must harder contraints, 100kB is still a lot
of memory to put into a peripheral device and typical DFA sizes are on
the order of MBs.  For example, when I first investigated Snort a few
years ago, the DFA compiled to 160MB, but the target device we wanted
to build would have had a memory size of 8-32kB.  Even a pure NFA
representation of the Snort DB wouldn't fit in that target.

The current favorite algorithm that every network device is trying to
solve is applicaton identification (often called Deep Packet
Inspection or DPI).  The easiest way to envision it, is presume you
have written a suite of compilers, C, C++, C#, Java, Pascal, PL/I,
Haskell, ML, F#, Lisp, Scheme, etc. and for any given program, you run
all the compilers in parallel and if one of them compiles the program
successfully, you say that is the language the program is written in.

Now, that method is untenable as the sum of the grammars is simply too
large and instead one has regular expressions that look for typical
distinctions in the protocols, just as virus signatures look for
specific patterns that are part of the virus or IDS systems look
typical for attack vectors.  The resulting set of regular expressions
is similar in size to the IDS problem and again the target is the
network edge.

As a result of these memory limitations, there is a fairly active
community trying to find new and novel ways of representing regular
expressions that have the best characteristics of NFAs (linear size
growth) and DFAs (linear processing time).  Some clever ways of
avoiding DFA explosion have been found for important common cases.

If you find this area potentially interesting, I would recommend
looking up the papers by Michela Becchi and the work at IBM on
dot-star as two good places to start further investigation.

It's an interesting area to work because the desire to push this into
smaller, cheaper, less power hungry devices has just begun.  Today we
want these algorithms to run in routers (including SOHO routers bought
off-the-shelf). The desire to run them in smartphones and tablets is
appearing. You want your cars' network to be scure, so they will be
needed in the SOCs for that.  Beyond that, there are these
micro-devices called "motes" which are basically a radio and some
sensors that want to be secure. Eventually, it is possible that a
processor for this area could be embedded in your "smart" credit-card.

Hope this helps,
-Chris

******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris

[toc] | [prev] | [next] | [standalone]

#162 — Re: Matching very large patterns, was Re: Inverse grep

From	glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date	2011-06-20 04:52 +0000
Subject	Re: Matching very large patterns, was Re: Inverse grep
Message-ID	<11-06-036@comp.compilers>
In reply to	#159

Chris F Clark <cfc@shell01.theworld.com> wrote:
> Our moderator asked:
>> How serious is the explosion problem in practice, now that compilers
>> no longer have to fit into 64k? We use lex scanners with a thousand
>> tokens, and spamassassin feeds about 2000 patterns to re2c if you use
>> the compiled patterns feature. None of them seem to strain our storage
>> systems. -John

> For compilers on mainline chips this is generally not a problem.
> However, this is such a useful tool, it gets used in areas where the
> problems are much-much larger (or the device is smaller).  The
> Aho-Corasick algorithm is one of the key algorithms in matching a set
> of strings to a large set of incoming data. The two places where I've
> seen it used and where size is a problem is networking and
> computational biology.

The favorite for computational biology for many years was dynamic
programming, though often it is too slow.  The dynamic programming
algorithm used for diff originated in the one Needleman-Wunsch used
for protein sequences.  (I believe that it is referenced in the diff
documentation.)

More recently, the Burrows-Wheeler transform, commonly used in
compression, has become popular for computational biology.

BLAST is based on a DFA similar to the first part of Aho-Corasick, but
dynamic programming is used instead of the usual second part.

-- glen

[toc] | [prev] | [next] | [standalone]

#655

From	Paul Wankadia <junyer@gmail.com>
Date	2012-05-28 07:50 -0700
Message-ID	<12-05-026@comp.compilers>
In reply to	#141

On Jun 9 2011, 9:01 am, glen herrmannsfeldt <gah@ugcs.caltech.edu>
wrote:

> I suppose this is a strange question, but I was wondering if
> there was ever something like an inverse grep.  That is,
> match a string against a file full of regular expressions.
>
> Now, one could just read the file, compile the regex one at
> a time, and do the match, but maybe there is another way.

The best way to match lots of regular expressions is not to match lots
of regular expressions. :)

In a nutshell, extract the fixed substrings of the regular
expressions, identify which of the fixed substrings (if any) are
present in the input string and determine which of the regular
expressions (if any) are worth trying to match.

The RE2 library includes this functionality, but you will need to BYO
Aho-Corasick implementation. Refer to http://code.google.com/p/re2/source/browse/re2/filtered_re2.h.

[toc] | [prev] | [standalone]

csiph-web

Inverse grep

Contents

#141 — Inverse grep

#144

#147

#148

#150

#159 — Matching very large patterns, was Re: Inverse grep

#162 — Re: Matching very large patterns, was Re: Inverse grep

#655