Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.linkpendium.com!news.linkpendium.com!news.iecc.com!nerds-end
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.compilers
Subject: Re: Inverse grep
Date: Wed, 15 Jun 2011 12:28:07 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 39
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <11-06-024@comp.compilers>
References: <11-06-015@comp.compilers> <11-06-022@comp.compilers>
NNTP-Posting-Host: news.iecc.com
X-Trace: gal.iecc.com 1308269042 83949 64.57.183.58 (17 Jun 2011 00:04:02 GMT)
X-Complaints-To: abuse@iecc.com
NNTP-Posting-Date: Fri, 17 Jun 2011 00:04:02 +0000 (UTC)
Keywords: lex, performance
Posted-Date: 16 Jun 2011 20:04:01 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: x330-a1.tempe.blueboxinc.net comp.compilers:150

torbenm@diku.dk (Torben Fgidius Mogensen) writes:
>glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:
>
>> I suppose this is a strange question, but I was wondering if
>> there was ever something like an inverse grep.  That is,
>> match a string against a file full of regular expressions. ...

>[How serious is the explosion problem in practice, now that compilers
>no longer have to fit into 64k? We use lex scanners with a thousand
>tokens, and spamassassin feeds about 2000 patterns to re2c if you use
>the compiled patterns feature. None of them seem to strain our storage
>systems. -John]

Even if it was a problem, the solution has been around a long time
(although I am not aware of an academic paper about it): Generate the
DFA on-demand, only for those transitions that are actually used.
Then the worst-case memory demands are O(n) where n is the length of
the input string (and in the usual case the memory usage grows much
slower).

In tree-parsing automata (i.e., not for regexps) I have had cases
where the size of the statically generated automaton was a problem
<https://www.complang.tuwien.ac.at/anton/lazyburg/gforth9.burg>: We
could not generate the automata for the largest tree grammars in
practical time and even had problems compiling some of the code that
we could generate.  OTOH, on-demand generation of automaton states
worked even for the largest tree grammars, and memory consumption was
on the order of 2MB IIRC.  The numbers will be different for regexps,
but I expect the effect to be the same: on-demand automata will work
in many cases where static automata take too long to generate or are
too big to run.  The paper about this work is
<http://www.complang.tuwien.ac.at/papers/ertl+06pldi.ps.gz>

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/