Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.linkpendium.com!news.linkpendium.com!news.iecc.com!nerds-end From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.compilers Subject: Re: Inverse grep Date: Wed, 15 Jun 2011 12:28:07 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 39 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-06-024@comp.compilers> References: <11-06-015@comp.compilers> <11-06-022@comp.compilers> NNTP-Posting-Host: news.iecc.com X-Trace: gal.iecc.com 1308269042 83949 64.57.183.58 (17 Jun 2011 00:04:02 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Fri, 17 Jun 2011 00:04:02 +0000 (UTC) Keywords: lex, performance Posted-Date: 16 Jun 2011 20:04:01 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:150 torbenm@diku.dk (Torben Fgidius Mogensen) writes: >glen herrmannsfeldt writes: > >> I suppose this is a strange question, but I was wondering if >> there was ever something like an inverse grep. That is, >> match a string against a file full of regular expressions. ... >[How serious is the explosion problem in practice, now that compilers >no longer have to fit into 64k? We use lex scanners with a thousand >tokens, and spamassassin feeds about 2000 patterns to re2c if you use >the compiled patterns feature. None of them seem to strain our storage >systems. -John] Even if it was a problem, the solution has been around a long time (although I am not aware of an academic paper about it): Generate the DFA on-demand, only for those transitions that are actually used. Then the worst-case memory demands are O(n) where n is the length of the input string (and in the usual case the memory usage grows much slower). In tree-parsing automata (i.e., not for regexps) I have had cases where the size of the statically generated automaton was a problem : We could not generate the automata for the largest tree grammars in practical time and even had problems compiling some of the code that we could generate. OTOH, on-demand generation of automaton states worked even for the largest tree grammars, and memory consumption was on the order of 2MB IIRC. The numbers will be different for regexps, but I expect the effect to be the same: on-demand automata will work in many cases where static automata take too long to generate or are too big to run. The paper about this work is - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/