Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #141 > unrolled thread
| Started by | glen herrmannsfeldt <gah@ugcs.caltech.edu> |
|---|---|
| First post | 2011-06-08 23:01 +0000 |
| Last post | 2012-05-28 07:50 -0700 |
| Articles | 8 — 6 participants |
Back to article view | Back to comp.compilers
Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-08 23:01 +0000
Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 14:16 -0400
Re: Inverse grep Tony Finch <dot@dotat.at> - 2011-06-13 17:48 +0100
Re: Inverse grep torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-06-14 11:16 +0200
Re: Inverse grep anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2011-06-15 12:28 +0000
Matching very large patterns, was Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-19 21:45 -0400
Re: Matching very large patterns, was Re: Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-20 04:52 +0000
Re: Inverse grep Paul Wankadia <junyer@gmail.com> - 2012-05-28 07:50 -0700
| From | glen herrmannsfeldt <gah@ugcs.caltech.edu> |
|---|---|
| Date | 2011-06-08 23:01 +0000 |
| Subject | Inverse grep |
| Message-ID | <11-06-015@comp.compilers> |
I suppose this is a strange question, but I was wondering if there was ever something like an inverse grep. That is, match a string against a file full of regular expressions. Now, one could just read the file, compile the regex one at a time, and do the match, but maybe there is another way. -- glen [If you want to know which pattern it was, there's flex which turns all the patterns into one DFA with tags to know which one it was, or else there's the perl "study" operator which pre-scans a string to make its NFA matcher faster on subsequent runs against the same string. -John]
[toc] | [next] | [standalone]
| From | Chris F Clark <cfc@shell01.TheWorld.com> |
|---|---|
| Date | 2011-06-12 14:16 -0400 |
| Message-ID | <11-06-018@comp.compilers> |
| In reply to | #141 |
Actually, most "intrusion detection systems" (IDSes, e.g. Snort) and virus scanners (e.g. ClamAV) are essentially just that. Given a packet, the contents of the packet are inspected to see which if any patterns are matched and if so, the relevant pattern [numbers] are reported. Generally, the technology used does not actually build a single pattern from the | of the patterns and run that FA, because there aren't known efficient solutions to that problem for large pattern sets. One either suffers space or time explosion to do so. Hope this helps, -Chris ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris [On machines with gigabyte memories, does the space explosion for large patterns still matter? -John]
[toc] | [prev] | [next] | [standalone]
| From | Tony Finch <dot@dotat.at> |
|---|---|
| Date | 2011-06-13 17:48 +0100 |
| Message-ID | <11-06-021@comp.compilers> |
| In reply to | #144 |
Chris F Clark <cfc@shell01.TheWorld.com> wrote: > >Actually, most "intrusion detection systems" (IDSes, e.g. Snort) and >virus scanners (e.g. ClamAV) are essentially just that. Given a >packet, the contents of the packet are inspected to see which if any >patterns are matched and if so, the relevant pattern [numbers] are >reported. Another tool to look at is re2c which can be used with SpamAssassin to compile its rule regexes. Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ Faeroes, South-east Iceland: Variable 3 or 4 becoming north or northwest 4 or 5. Moderate. Occasional rain. Moderate or good.
[toc] | [prev] | [next] | [standalone]
| From | torbenm@diku.dk (Torben Ægidius Mogensen) |
|---|---|
| Date | 2011-06-14 11:16 +0200 |
| Message-ID | <11-06-022@comp.compilers> |
| In reply to | #141 |
glen herrmannsfeldt <gah@ugcs.caltech.edu> writes: > I suppose this is a strange question, but I was wondering if > there was ever something like an inverse grep. That is, > match a string against a file full of regular expressions. > > Now, one could just read the file, compile the regex one at > a time, and do the match, but maybe there is another way. > > [If you want to know which pattern it was, there's flex which turns all > the patterns into one DFA with tags to know which one it was, or else > there's the perl "study" operator which pre-scans a string to make its > NFA matcher faster on subsequent runs against the same string. -John] The best method depends on how many times you match the same set of regexps against different strings. If you use the same set with many strings, it can pay to preprocess the regexps to, e.g., produce automata (similar to what flex does), but if you match only a few times, it is probably better to keep the regexps unchanged and use an algorithm that matches a string to a regexp without first converting the regexp to a DFA. Converting each regexp to an NFA is fairly fast, so this is a possibility, but if only a small part of each NFA is ever used, it might be better to use the regexps directly using regular expression derivatives. Matching using regular expression derivatives is generally a bit slower than matching with NFAs, but you don't have to convert the regexps first. So if you don't expect to match the set of regexps to more than a handful of strings, this might be a better approach. As Chris said, the DFA for a union of a large number of regexps can explode, so it is a risky business to use this approach, even if you are going to match the regexps against many strings. You can convert the combined regexps to an NFA and do local reductions to keep the size down, but the matching time is still nearly the same as matching the NFAs for the individual regexps in sequence. A compromise could be to combine the NFAs from one end and stop when the resulting DFA becomes too big and then start combining the remaining NFAs from there. In the worst case, you don't gain anything over keeping the NFAs separate, though. Torben [How serious is the explosion problem in practice, now that compilers no longer have to fit into 64k? We use lex scanners with a thousand tokens, and spamassassin feeds about 2000 patterns to re2c if you use the compiled patterns feature. None of them seem to strain our storage systems. -John]
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2011-06-15 12:28 +0000 |
| Message-ID | <11-06-024@comp.compilers> |
| In reply to | #148 |
torbenm@diku.dk (Torben Fgidius Mogensen) writes: >glen herrmannsfeldt <gah@ugcs.caltech.edu> writes: > >> I suppose this is a strange question, but I was wondering if >> there was ever something like an inverse grep. That is, >> match a string against a file full of regular expressions. ... >[How serious is the explosion problem in practice, now that compilers >no longer have to fit into 64k? We use lex scanners with a thousand >tokens, and spamassassin feeds about 2000 patterns to re2c if you use >the compiled patterns feature. None of them seem to strain our storage >systems. -John] Even if it was a problem, the solution has been around a long time (although I am not aware of an academic paper about it): Generate the DFA on-demand, only for those transitions that are actually used. Then the worst-case memory demands are O(n) where n is the length of the input string (and in the usual case the memory usage grows much slower). In tree-parsing automata (i.e., not for regexps) I have had cases where the size of the statically generated automaton was a problem <https://www.complang.tuwien.ac.at/anton/lazyburg/gforth9.burg>: We could not generate the automata for the largest tree grammars in practical time and even had problems compiling some of the code that we could generate. OTOH, on-demand generation of automaton states worked even for the largest tree grammars, and memory consumption was on the order of 2MB IIRC. The numbers will be different for regexps, but I expect the effect to be the same: on-demand automata will work in many cases where static automata take too long to generate or are too big to run. The paper about this work is <http://www.complang.tuwien.ac.at/papers/ertl+06pldi.ps.gz> - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/
[toc] | [prev] | [next] | [standalone]
| From | Chris F Clark <cfc@shell01.TheWorld.com> |
|---|---|
| Date | 2011-06-19 21:45 -0400 |
| Subject | Matching very large patterns, was Re: Inverse grep |
| Message-ID | <11-06-033@comp.compilers> |
| In reply to | #148 |
Our moderator asked: > How serious is the explosion problem in practice, now that compilers > no longer have to fit into 64k? We use lex scanners with a thousand > tokens, and spamassassin feeds about 2000 patterns to re2c if you use > the compiled patterns feature. None of them seem to strain our storage > systems. -John For compilers on mainline chips this is generally not a problem. However, this is such a useful tool, it gets used in areas where the problems are much-much larger (or the device is smaller). The Aho-Corasick algorithm is one of the key algorithms in matching a set of strings to a large set of incoming data. The two places where I've seen it used and where size is a problem is networking and computational biology. I know less about the computational biology use-cases so I'll dispense with that first. As I understand it, the basic idea in DNA and protein databases is that you have long lists of strings that represent know genomes or protein foldings, and you match your unknown DNA or protein against that and see what matches you get. The number of known genomes and protein foldings is very large, on the order of millions. It is probable that generating the DFA on the fly might work in that application as Anton suggests, since one can generally apply a mainstream processor to the problem and one isn't resource constrained. On the networking side, there are numerous applications of this problem. It's the case where one has a hammer, every problem looks like a nail. The unix world loves writing tools that match regexes or sets of strings or some nearly equivalent notation (e.g. ACLs, globs, URL fragments) to the traffic. The first case is anti-virus (AV). The number of virus signatures is again in the millions. This is not enough to make the database "too big" in the sense that it won't fit in memory, but it does make it too big to fit in cache. So, although we can construct a DFA that fits in memory to represent the problem, the resulting DFA doesn't fit in cache and ends up thrashing the cache. This adversely effects the processor performance, as modern processors run many times fater than main memory speeds and depend on caching to keep that performance balance. These AV patterns are effective cache-walkers and any size of cache that the processor is provisioned with gets exceeded. In addition, the databasse is growing at a phenomenal rate. New viruses are added at the rate of several per hour and that rate is increaing. (I believe I've seen that a new virus signature is added every 9 minutes on average.) In fact, the rate of new virus detecting is so fast, that it isn't done by humans any more. Machine learning of the viruses from honeypots is becoming the only way to keep up with the rate of virus growth. The fact that the hackers are now using generators to create new unique polymorphic viruses for each attack makes this fact unsurprising--automation is balanced by automation. To lessen the dependence on AV, black-listing and white-listing of web sites has become popular. However, the numbers of URLs is still sufficiently large that it presents a similar performance issue. The problem in this case is exacerbated by the fact that the list of URLs is not a static database kept on the client, but instead is gotten from a web server. In this case, building the DFA on the fly might be pratical because the cost is driven by the downloading of the URLs off the web. Intrusion detection and prevention systems (IDS/IPS) are a smaller but similar problem, with a database on the order of 100k patterns with many of the patterns requiring back-references and other non-DFA functionality. Although cache thrashing is still somewhat of a problem if we solve the problem on the CPU, the bigger issue is that this problem would like to me moved out of the central processor to the network edge (i.e. the NIC) so that the traffic is never brought into the main memory system at all, as even the cost of moving the traffic onto the memory is a signficant load. At the edge we experience must harder contraints, 100kB is still a lot of memory to put into a peripheral device and typical DFA sizes are on the order of MBs. For example, when I first investigated Snort a few years ago, the DFA compiled to 160MB, but the target device we wanted to build would have had a memory size of 8-32kB. Even a pure NFA representation of the Snort DB wouldn't fit in that target. The current favorite algorithm that every network device is trying to solve is applicaton identification (often called Deep Packet Inspection or DPI). The easiest way to envision it, is presume you have written a suite of compilers, C, C++, C#, Java, Pascal, PL/I, Haskell, ML, F#, Lisp, Scheme, etc. and for any given program, you run all the compilers in parallel and if one of them compiles the program successfully, you say that is the language the program is written in. Now, that method is untenable as the sum of the grammars is simply too large and instead one has regular expressions that look for typical distinctions in the protocols, just as virus signatures look for specific patterns that are part of the virus or IDS systems look typical for attack vectors. The resulting set of regular expressions is similar in size to the IDS problem and again the target is the network edge. As a result of these memory limitations, there is a fairly active community trying to find new and novel ways of representing regular expressions that have the best characteristics of NFAs (linear size growth) and DFAs (linear processing time). Some clever ways of avoiding DFA explosion have been found for important common cases. If you find this area potentially interesting, I would recommend looking up the papers by Michela Becchi and the work at IBM on dot-star as two good places to start further investigation. It's an interesting area to work because the desire to push this into smaller, cheaper, less power hungry devices has just begun. Today we want these algorithms to run in routers (including SOHO routers bought off-the-shelf). The desire to run them in smartphones and tablets is appearing. You want your cars' network to be scure, so they will be needed in the SOCs for that. Beyond that, there are these micro-devices called "motes" which are basically a radio and some sensors that want to be secure. Eventually, it is possible that a processor for this area could be embedded in your "smart" credit-card. Hope this helps, -Chris ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris
[toc] | [prev] | [next] | [standalone]
| From | glen herrmannsfeldt <gah@ugcs.caltech.edu> |
|---|---|
| Date | 2011-06-20 04:52 +0000 |
| Subject | Re: Matching very large patterns, was Re: Inverse grep |
| Message-ID | <11-06-036@comp.compilers> |
| In reply to | #159 |
Chris F Clark <cfc@shell01.theworld.com> wrote: > Our moderator asked: >> How serious is the explosion problem in practice, now that compilers >> no longer have to fit into 64k? We use lex scanners with a thousand >> tokens, and spamassassin feeds about 2000 patterns to re2c if you use >> the compiled patterns feature. None of them seem to strain our storage >> systems. -John > For compilers on mainline chips this is generally not a problem. > However, this is such a useful tool, it gets used in areas where the > problems are much-much larger (or the device is smaller). The > Aho-Corasick algorithm is one of the key algorithms in matching a set > of strings to a large set of incoming data. The two places where I've > seen it used and where size is a problem is networking and > computational biology. The favorite for computational biology for many years was dynamic programming, though often it is too slow. The dynamic programming algorithm used for diff originated in the one Needleman-Wunsch used for protein sequences. (I believe that it is referenced in the diff documentation.) More recently, the Burrows-Wheeler transform, commonly used in compression, has become popular for computational biology. BLAST is based on a DFA similar to the first part of Aho-Corasick, but dynamic programming is used instead of the usual second part. -- glen
[toc] | [prev] | [next] | [standalone]
| From | Paul Wankadia <junyer@gmail.com> |
|---|---|
| Date | 2012-05-28 07:50 -0700 |
| Message-ID | <12-05-026@comp.compilers> |
| In reply to | #141 |
On Jun 9 2011, 9:01 am, glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote: > I suppose this is a strange question, but I was wondering if > there was ever something like an inverse grep. That is, > match a string against a file full of regular expressions. > > Now, one could just read the file, compile the regex one at > a time, and do the match, but maybe there is another way. The best way to match lots of regular expressions is not to match lots of regular expressions. :) In a nutshell, extract the fixed substrings of the regular expressions, identify which of the fixed substrings (if any) are present in the input string and determine which of the regular expressions (if any) are worth trying to match. The RE2 library includes this functionality, but you will need to BYO Aho-Corasick implementation. Refer to http://code.google.com/p/re2/source/browse/re2/filtered_re2.h.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web