Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.alt.net!news.kjsl.com!usenet.stanford.edu!news.iecc.com!nerds-end From: glen herrmannsfeldt Newsgroups: comp.compilers Subject: Re: Matching very large patterns, was Re: Inverse grep Date: Mon, 20 Jun 2011 04:52:28 +0000 (UTC) Organization: A noiseless patient Spider Lines: 30 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-06-036@comp.compilers> References: <11-06-015@comp.compilers> <11-06-022@comp.compilers> <11-06-033@comp.compilers> NNTP-Posting-Host: news.iecc.com X-Trace: gal.iecc.com 1308717545 44591 64.57.183.58 (22 Jun 2011 04:39:05 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Wed, 22 Jun 2011 04:39:05 +0000 (UTC) Keywords: lex, storage Posted-Date: 22 Jun 2011 00:39:05 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:162 Chris F Clark wrote: > Our moderator asked: >> How serious is the explosion problem in practice, now that compilers >> no longer have to fit into 64k? We use lex scanners with a thousand >> tokens, and spamassassin feeds about 2000 patterns to re2c if you use >> the compiled patterns feature. None of them seem to strain our storage >> systems. -John > For compilers on mainline chips this is generally not a problem. > However, this is such a useful tool, it gets used in areas where the > problems are much-much larger (or the device is smaller). The > Aho-Corasick algorithm is one of the key algorithms in matching a set > of strings to a large set of incoming data. The two places where I've > seen it used and where size is a problem is networking and > computational biology. The favorite for computational biology for many years was dynamic programming, though often it is too slow. The dynamic programming algorithm used for diff originated in the one Needleman-Wunsch used for protein sequences. (I believe that it is referenced in the diff documentation.) More recently, the Burrows-Wheeler transform, commonly used in compression, has become popular for computational biology. BLAST is based on a DFA similar to the first part of Aho-Corasick, but dynamic programming is used instead of the usual second part. -- glen