Groups | Search | Server Info | Keyboard shortcuts | Login | Register


Groups > comp.compilers > #148

Re: Inverse grep

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!news.glorb.com!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!novia!news-out.readnews.com!news-xxxfer.readnews.com!news.misty.com!news.iecc.com!nerds-end
From torbenm@diku.dk (Torben Ægidius Mogensen)
Newsgroups comp.compilers
Subject Re: Inverse grep
Date Tue, 14 Jun 2011 11:16:52 +0200
Organization SunSITE.dk - Supporting Open source
Lines 46
Sender news@iecc.com
Approved comp.compilers@iecc.com
Message-ID <11-06-022@comp.compilers> (permalink)
References <11-06-015@comp.compilers>
NNTP-Posting-Host news.iecc.com
X-Trace gal.iecc.com 1308096910 24846 64.57.183.58 (15 Jun 2011 00:15:10 GMT)
X-Complaints-To abuse@iecc.com
NNTP-Posting-Date Wed, 15 Jun 2011 00:15:10 +0000 (UTC)
Keywords lex, performance, comment
Posted-Date 14 Jun 2011 20:15:10 EDT
X-submission-address compilers@iecc.com
X-moderator-address compilers-request@iecc.com
X-FAQ-and-archives http://compilers.iecc.com
Xref x330-a1.tempe.blueboxinc.net comp.compilers:148

Show key headers only | View raw


glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:

> I suppose this is a strange question, but I was wondering if
> there was ever something like an inverse grep.  That is,
> match a string against a file full of regular expressions.
>
> Now, one could just read the file, compile the regex one at
> a time, and do the match, but maybe there is another way.
>
> [If you want to know which pattern it was, there's flex which turns all
> the patterns into one DFA with tags to know which one it was, or else
> there's the perl "study" operator which pre-scans a string to make its
> NFA matcher faster on subsequent runs against the same string. -John]

The best method depends on how many times you match the same set of
regexps against different strings.  If you use the same set with many
strings, it can pay to preprocess the regexps to, e.g., produce
automata (similar to what flex does), but if you match only a few
times, it is probably better to keep the regexps unchanged and use an
algorithm that matches a string to a regexp without first converting
the regexp to a DFA.  Converting each regexp to an NFA is fairly fast,
so this is a possibility, but if only a small part of each NFA is ever
used, it might be better to use the regexps directly using regular
expression derivatives.  Matching using regular expression derivatives
is generally a bit slower than matching with NFAs, but you don't have
to convert the regexps first.  So if you don't expect to match the set
of regexps to more than a handful of strings, this might be a better
approach.

As Chris said, the DFA for a union of a large number of regexps can
explode, so it is a risky business to use this approach, even if you
are going to match the regexps against many strings.  You can convert
the combined regexps to an NFA and do local reductions to keep the
size down, but the matching time is still nearly the same as matching
the NFAs for the individual regexps in sequence.  A compromise could
be to combine the NFAs from one end and stop when the resulting DFA
becomes too big and then start combining the remaining NFAs from
there.  In the worst case, you don't gain anything over keeping the
NFAs separate, though.

	Torben
[How serious is the explosion problem in practice, now that compilers
no longer have to fit into 64k? We use lex scanners with a thousand
tokens, and spamassassin feeds about 2000 patterns to re2c if you use
the compiled patterns feature. None of them seem to strain our storage
systems. -John]

Back to comp.compilers | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-08 23:01 +0000
  Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 14:16 -0400
    Re: Inverse grep Tony Finch <dot@dotat.at> - 2011-06-13 17:48 +0100
  Re: Inverse grep torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-06-14 11:16 +0200
    Re: Inverse grep anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2011-06-15 12:28 +0000
    Matching very large patterns, was Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-19 21:45 -0400
      Re: Matching very large patterns, was Re: Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-20 04:52 +0000

csiph-web