Re: Inverse grep

From	torbenm@diku.dk (Torben Ægidius Mogensen)
Newsgroups	comp.compilers
Subject	Re: Inverse grep
Date	2011-06-14 11:16 +0200
Organization	SunSITE.dk - Supporting Open source
Message-ID	<11-06-022@comp.compilers> (permalink)
References	<11-06-015@comp.compilers>

Show all headers | View raw

glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:

> I suppose this is a strange question, but I was wondering if
> there was ever something like an inverse grep.  That is,
> match a string against a file full of regular expressions.
>
> Now, one could just read the file, compile the regex one at
> a time, and do the match, but maybe there is another way.
>
> [If you want to know which pattern it was, there's flex which turns all
> the patterns into one DFA with tags to know which one it was, or else
> there's the perl "study" operator which pre-scans a string to make its
> NFA matcher faster on subsequent runs against the same string. -John]

The best method depends on how many times you match the same set of
regexps against different strings.  If you use the same set with many
strings, it can pay to preprocess the regexps to, e.g., produce
automata (similar to what flex does), but if you match only a few
times, it is probably better to keep the regexps unchanged and use an
algorithm that matches a string to a regexp without first converting
the regexp to a DFA.  Converting each regexp to an NFA is fairly fast,
so this is a possibility, but if only a small part of each NFA is ever
used, it might be better to use the regexps directly using regular
expression derivatives.  Matching using regular expression derivatives
is generally a bit slower than matching with NFAs, but you don't have
to convert the regexps first.  So if you don't expect to match the set
of regexps to more than a handful of strings, this might be a better
approach.

As Chris said, the DFA for a union of a large number of regexps can
explode, so it is a risky business to use this approach, even if you
are going to match the regexps against many strings.  You can convert
the combined regexps to an NFA and do local reductions to keep the
size down, but the matching time is still nearly the same as matching
the NFAs for the individual regexps in sequence.  A compromise could
be to combine the NFAs from one end and stop when the resulting DFA
becomes too big and then start combining the remaining NFAs from
there.  In the worst case, you don't gain anything over keeping the
NFAs separate, though.

	Torben
[How serious is the explosion problem in practice, now that compilers
no longer have to fit into 64k? We use lex scanners with a thousand
tokens, and spamassassin feeds about 2000 patterns to re2c if you use
the compiled patterns feature. None of them seem to strain our storage
systems. -John]

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-08 23:01 +0000
  Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 14:16 -0400
    Re: Inverse grep Tony Finch <dot@dotat.at> - 2011-06-13 17:48 +0100
  Re: Inverse grep torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-06-14 11:16 +0200
    Re: Inverse grep anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2011-06-15 12:28 +0000
    Matching very large patterns, was Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-19 21:45 -0400
      Re: Matching very large patterns, was Re: Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-20 04:52 +0000

csiph-web