Groups | Search | Server Info | Login | Register
Groups > comp.compilers > #159
| From | Chris F Clark <cfc@shell01.TheWorld.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Matching very large patterns, was Re: Inverse grep |
| Date | 2011-06-19 21:45 -0400 |
| Organization | The World Public Access UNIX, Brookline, MA |
| Message-ID | <11-06-033@comp.compilers> (permalink) |
| References | <11-06-015@comp.compilers> <11-06-022@comp.compilers> |
Our moderator asked: > How serious is the explosion problem in practice, now that compilers > no longer have to fit into 64k? We use lex scanners with a thousand > tokens, and spamassassin feeds about 2000 patterns to re2c if you use > the compiled patterns feature. None of them seem to strain our storage > systems. -John For compilers on mainline chips this is generally not a problem. However, this is such a useful tool, it gets used in areas where the problems are much-much larger (or the device is smaller). The Aho-Corasick algorithm is one of the key algorithms in matching a set of strings to a large set of incoming data. The two places where I've seen it used and where size is a problem is networking and computational biology. I know less about the computational biology use-cases so I'll dispense with that first. As I understand it, the basic idea in DNA and protein databases is that you have long lists of strings that represent know genomes or protein foldings, and you match your unknown DNA or protein against that and see what matches you get. The number of known genomes and protein foldings is very large, on the order of millions. It is probable that generating the DFA on the fly might work in that application as Anton suggests, since one can generally apply a mainstream processor to the problem and one isn't resource constrained. On the networking side, there are numerous applications of this problem. It's the case where one has a hammer, every problem looks like a nail. The unix world loves writing tools that match regexes or sets of strings or some nearly equivalent notation (e.g. ACLs, globs, URL fragments) to the traffic. The first case is anti-virus (AV). The number of virus signatures is again in the millions. This is not enough to make the database "too big" in the sense that it won't fit in memory, but it does make it too big to fit in cache. So, although we can construct a DFA that fits in memory to represent the problem, the resulting DFA doesn't fit in cache and ends up thrashing the cache. This adversely effects the processor performance, as modern processors run many times fater than main memory speeds and depend on caching to keep that performance balance. These AV patterns are effective cache-walkers and any size of cache that the processor is provisioned with gets exceeded. In addition, the databasse is growing at a phenomenal rate. New viruses are added at the rate of several per hour and that rate is increaing. (I believe I've seen that a new virus signature is added every 9 minutes on average.) In fact, the rate of new virus detecting is so fast, that it isn't done by humans any more. Machine learning of the viruses from honeypots is becoming the only way to keep up with the rate of virus growth. The fact that the hackers are now using generators to create new unique polymorphic viruses for each attack makes this fact unsurprising--automation is balanced by automation. To lessen the dependence on AV, black-listing and white-listing of web sites has become popular. However, the numbers of URLs is still sufficiently large that it presents a similar performance issue. The problem in this case is exacerbated by the fact that the list of URLs is not a static database kept on the client, but instead is gotten from a web server. In this case, building the DFA on the fly might be pratical because the cost is driven by the downloading of the URLs off the web. Intrusion detection and prevention systems (IDS/IPS) are a smaller but similar problem, with a database on the order of 100k patterns with many of the patterns requiring back-references and other non-DFA functionality. Although cache thrashing is still somewhat of a problem if we solve the problem on the CPU, the bigger issue is that this problem would like to me moved out of the central processor to the network edge (i.e. the NIC) so that the traffic is never brought into the main memory system at all, as even the cost of moving the traffic onto the memory is a signficant load. At the edge we experience must harder contraints, 100kB is still a lot of memory to put into a peripheral device and typical DFA sizes are on the order of MBs. For example, when I first investigated Snort a few years ago, the DFA compiled to 160MB, but the target device we wanted to build would have had a memory size of 8-32kB. Even a pure NFA representation of the Snort DB wouldn't fit in that target. The current favorite algorithm that every network device is trying to solve is applicaton identification (often called Deep Packet Inspection or DPI). The easiest way to envision it, is presume you have written a suite of compilers, C, C++, C#, Java, Pascal, PL/I, Haskell, ML, F#, Lisp, Scheme, etc. and for any given program, you run all the compilers in parallel and if one of them compiles the program successfully, you say that is the language the program is written in. Now, that method is untenable as the sum of the grammars is simply too large and instead one has regular expressions that look for typical distinctions in the protocols, just as virus signatures look for specific patterns that are part of the virus or IDS systems look typical for attack vectors. The resulting set of regular expressions is similar in size to the IDS problem and again the target is the network edge. As a result of these memory limitations, there is a fairly active community trying to find new and novel ways of representing regular expressions that have the best characteristics of NFAs (linear size growth) and DFAs (linear processing time). Some clever ways of avoiding DFA explosion have been found for important common cases. If you find this area potentially interesting, I would recommend looking up the papers by Michela Becchi and the work at IBM on dot-star as two good places to start further investigation. It's an interesting area to work because the desire to push this into smaller, cheaper, less power hungry devices has just begun. Today we want these algorithms to run in routers (including SOHO routers bought off-the-shelf). The desire to run them in smartphones and tablets is appearing. You want your cars' network to be scure, so they will be needed in the SOCs for that. Beyond that, there are these micro-devices called "motes" which are basically a radio and some sensors that want to be secure. Eventually, it is possible that a processor for this area could be embedded in your "smart" credit-card. Hope this helps, -Chris ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-08 23:01 +0000
Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-12 14:16 -0400
Re: Inverse grep Tony Finch <dot@dotat.at> - 2011-06-13 17:48 +0100
Re: Inverse grep torbenm@diku.dk (Torben Ægidius Mogensen) - 2011-06-14 11:16 +0200
Re: Inverse grep anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2011-06-15 12:28 +0000
Matching very large patterns, was Re: Inverse grep Chris F Clark <cfc@shell01.TheWorld.com> - 2011-06-19 21:45 -0400
Re: Matching very large patterns, was Re: Inverse grep glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-06-20 04:52 +0000
Re: Inverse grep Paul Wankadia <junyer@gmail.com> - 2012-05-28 07:50 -0700
csiph-web