Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <433-929-6894@kylheku.com> Newsgroups: comp.compilers Subject: Re: Interesting paper on regex NFA matching Date: Fri, 26 Jan 2024 04:16:38 -0000 Organization: Compilers Central Sender: johnl%iecc.com Approved: comp.compilers@iecc.com Message-ID: <24-01-005@comp.compilers> References: <24-01-003@comp.compilers> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="464"; mail-complaints-to="abuse@iecc.com" Keywords: lex, NFA, performance, comment Posted-Date: 26 Jan 2024 10:22:41 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3547 On 2024-01-26, John R Levine wrote: > The easiest way to implement regular expressions is to turn them into > NFAs, but that has well known problems that if you feed hostile input > to the NFAs, they can take exponential time and it's a way to DDoS the > server. Do you have a reference for this? I seem to recall that NFA simulation is quadratic or cubic at worst. The exponential problem that immediately comes to mind in connection with regexes is that the NFA to DFA subset construction can experience an exponential state explosion. The NFA simulator won't experience that expoential explosion because it only generates the subsets that occur from the specific input that it's given; it's not reasoning about all possible inputs. > if the regex is ambiguous, as many are, the DFA may match something > different from the NFA. Is that also documented somewhere? DFA matching something different from the NFA has to be a bug. The algorithms we have in this area are correct. What does it mean for a regex to be ambiguous? > This paper on the Arxiv preprint server proposes some rather complex > tweaks to NFA regex matching to fix many performance issues while giving > the same answers as before. > > https://arxiv.org/abs/2401.12639 I'm going to look at this in more detail, but I can tell you from the abstract that they are taking on backtracking regex implementations that suffer from "catastrophic backtracking", which informs me of the general direction. Backtracking regexes are used to implement the hacky features found in PCRE (perl compatible regex). They have almost nothing to do with building NFA graphs and simulating them, or going to DFA from that. Backtracking follows and interprets the abstract syntax tree of a regex. One known problem with backtracking regex implementations is that they don't treat branches; R1|R2|R3|..|R4 correctly. Backtrackers iterate over the branches left to right and try them. They stop when they encounter a match for the first one, and declare that the entire disjunction matched. That would be fine if we are just asking whether there is a match, but it's incorrect if we are matching a regex prefix (or looking for a substring) and need the longest match. For instance "a|aa" matches "aardvark", but only the first "a"! If we simulate the a|aa NFA graph, what we do is feed every character of aardvark until we hit an impasse (the machine informs us that no more input can be accepted). At that point, we consider the most recent character position where we had been in an accepted state. That position will be the second "a" of "aardvark". After we feed the first "a" to the simulator, it will report that it's in an acceptance state. But we keep going; we feed the second "a", and for that it also reports acceptance. When we feed the "r" the machine can indicate that it has hit a dead end: that character cannot be matched. It is at this point that we can take a shortcut and stop feeding more characters. The last time we had an acceptance state was just after we fed the second "a", and so we know that the regex matched the "aa" part of the word. The NFA doesn't care whether the original syntax is a|aa or aa|a: the operator is commutative. A DFA implementation will not exhibit the behavior of searching through the branches of a|aa left to right for a match, because it follows from the NFA. So because of this kind of situation, the backtracking regex faces difficulties in translation to DFA. It is caused by the incorrect, loose, hacky interpretation of the backtracking implementation which pull stunts like not maing A|B commutative. NFA simulation isn't backtracking; it doesn't hit those kinds of explosions inherent in backtracking. Backtracking is not required for implementing classic regexes, like POSIX EREs and BREs and whatnot, only Perl-like cruft. Backtracking gets into trouble due to searches occuring at multiple levels of recursion. It's like any recursive search with an exploding space. It's clear memoization can help under the right conditions, whereby it can recognize redundancy in the space, to avoid processing subspaces that are identical to ones already processed ("overlapping subproblems"). (The NFA to DFA construction involves a kind of memoization. How so? The algorithm generates subsets of NFA states for imaginary inputs, and turns those sets of NFA states into single DFA states. As it probes into that space, it has to recognize subsets it has already generated before! Each such a seen-before subset represents an existing DFA state that has already been created. The transitions should go to that state rather than making a new one which will turn out identical.) (Also, caching approaches are possible in simulating an NFA: instead of computing a DFA ahead of time, we lazily generate some of it as we go through the input, and cache some of it. Say up to 32 states, with some LRU replacement or whtaever. It's like "JIT" for NFAs. We get some of the speed benefit of DFA, but the state explosion of the DFA is thwarted.) (I personaly am not so interested in backtracking regexes; and they are not directly applicable to compilers. You'd never want to design a lexical analyzer that requires "lookaround" or whatnot in order to recognize tokens. It's nice they are generating Arxiv papers for someone; no genuine intellectual inquiry is invalid; good luck! It's not applicable to the "Programming Languages" category it has been filed under. I will skim through it, though.) -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca [You're right, it's polynomial, but that's bad enough. Ambiguous is if there are multiple subpatterns that can match the same string, e.g. (foo|[a-z]+) matching "foo". Often your code cares which one it matches even though that's arguably a bug. -John]