Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > comp.compilers > #148
| From | torbenm@diku.dk (Torben Ægidius Mogensen) |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Re: Inverse grep |
| Date | 2011-06-14 11:16 +0200 |
| Organization | SunSITE.dk - Supporting Open source |
| Message-ID | <11-06-022@comp.compilers> (permalink) |
| References | <11-06-015@comp.compilers> |
glen herrmannsfeldt <gah@ugcs.caltech.edu> writes: > I suppose this is a strange question, but I was wondering if > there was ever something like an inverse grep. That is, > match a string against a file full of regular expressions. > > Now, one could just read the file, compile the regex one at > a time, and do the match, but maybe there is another way. > > [If you want to know which pattern it was, there's flex which turns all > the patterns into one DFA with tags to know which one it was, or else > there's the perl "study" operator which pre-scans a string to make its > NFA matcher faster on subsequent runs against the same string. -John] The best method depends on how many times you match the same set of regexps against different strings. If you use the same set with many strings, it can pay to preprocess the regexps to, e.g., produce automata (similar to what flex does), but if you match only a few times, it is probably better to keep the regexps unchanged and use an algorithm that matches a string to a regexp without first converting the regexp to a DFA. Converting each regexp to an NFA is fairly fast, so this is a possibility, but if only a small part of each NFA is ever used, it might be better to use the regexps directly using regular expression derivatives. Matching using regular expression derivatives is generally a bit slower than matching with NFAs, but you don't have to convert the regexps first. So if you don't expect to match the set of regexps to more than a handful of strings, this might be a better approach. As Chris said, the DFA for a union of a large number of regexps can explode, so it is a risky business to use this approach, even if you are going to match the regexps against many strings. You can convert the combined regexps to an NFA and do local reductions to keep the size down, but the matching time is still nearly the same as matching the NFAs for the individual regexps in sequence. A compromise could be to combine the NFAs from one end and stop when the resulting DFA becomes too big and then start combining the remaining NFAs from there. In the worst case, you don't gain anything over keeping the NFAs separate, though. Torben [How serious is the explosion problem in practice, now that compilers no longer have to fit into 64k? We use lex scanners with a thousand tokens, and spamassassin feeds about 2000 patterns to re2c if you use the compiled patterns feature. None of them seem to strain our storage systems. -John]
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-08 23:01 +0000
Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 14:16 -0400
Re: Inverse grep Tony Finch <dot@dotat.at> - 2011-06-13 17:48 +0100
Re: Inverse grep torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-06-14 11:16 +0200
Re: Inverse grep anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2011-06-15 12:28 +0000
Matching very large patterns, was Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-19 21:45 -0400
Re: Matching very large patterns, was Re: Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-20 04:52 +0000
csiph-web