Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <433-929-6894@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: Interesting paper on regex NFA matching
Date: Fri, 26 Jan 2024 04:16:38 -0000
Organization: Compilers Central
Sender: johnl%iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <24-01-005@comp.compilers>
References: <24-01-003@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="464"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, NFA, performance, comment
Posted-Date: 26 Jan 2024 10:22:41 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3547

On 2024-01-26, John R Levine <johnl@taugh.com> wrote:
> The easiest way to implement regular expressions is to turn them into
> NFAs, but that has well known problems that if you feed hostile input
> to the NFAs, they can take exponential time and it's a way to DDoS the
> server.

Do you have a reference for this? I seem to recall that NFA simulation
is quadratic or cubic at worst.

The exponential problem that immediately comes to mind in connection
with regexes is that the NFA to DFA subset construction can experience
an exponential state explosion.

The NFA simulator won't experience that expoential explosion because
it only generates the subsets that occur from the specific input
that it's given; it's not reasoning about all possible inputs.

> if the regex is ambiguous, as many are, the DFA may match something
> different from the NFA.

Is that also documented somewhere? DFA matching something different
from the NFA has to be a bug. The algorithms we have in this area
are correct.

What does it mean for a regex to be ambiguous?

> This paper on the Arxiv preprint server proposes some rather complex
> tweaks to NFA regex matching to fix many performance issues while giving
> the same answers as before.
>
> https://arxiv.org/abs/2401.12639

I'm going to look at this in more detail, but I can tell you from the
abstract that they are taking on backtracking regex implementations that
suffer from "catastrophic backtracking", which informs me of the general
direction.

Backtracking regexes are used to implement the hacky features found
in PCRE (perl compatible regex).

They have almost nothing to do with building NFA graphs and simulating
them, or going to DFA from that.

Backtracking follows and interprets the abstract syntax tree of a regex.

One known problem with backtracking regex implementations is
that they don't treat branches;  R1|R2|R3|..|R4 correctly.

Backtrackers iterate over the branches left to right and try them.
They stop when they encounter a match for the first one, and declare
that the entire disjunction matched. That would be fine if we are just
asking whether there is a match, but it's incorrect if we are matching a
regex prefix (or looking for a substring) and need the longest match.
For instance "a|aa" matches "aardvark", but only the first "a"!

If we simulate the a|aa NFA graph, what we do is feed every character of
aardvark until we hit an impasse (the machine informs us that no more
input can be accepted). At that point, we consider the most recent
character position where we had been in an accepted state. That position
will be the second "a" of "aardvark". After we feed the first "a"
to the simulator, it will report that it's in an acceptance state. But
we keep going; we feed the second "a", and for that it also reports
acceptance. When we feed the "r" the machine can indicate that it has
hit a dead end: that character cannot be matched. It is at this point
that we can take a shortcut and stop feeding more characters. The last
time we had an acceptance state was just after we fed the second "a",
and so we know that the regex matched the "aa" part of the word.
The NFA doesn't care whether the original syntax is a|aa or aa|a: the
operator is commutative.

A DFA implementation will not exhibit the behavior of searching through
the branches of a|aa left to right for a match, because it follows from
the NFA. So because of this kind of situation, the backtracking regex
faces difficulties in translation to DFA. It is caused by the incorrect,
loose, hacky interpretation of the backtracking implementation which
pull stunts like not maing A|B commutative.

NFA simulation isn't backtracking; it doesn't hit those kinds of
explosions inherent in backtracking.

Backtracking is not required for implementing classic regexes, like
POSIX EREs and BREs and whatnot, only Perl-like cruft.

Backtracking gets into trouble due to searches occuring at multiple
levels of recursion. It's like any recursive search with an exploding
space.  It's clear memoization can help under the right conditions,
whereby it can recognize redundancy in the space, to avoid processing
subspaces that are identical to ones already processed ("overlapping
subproblems").

(The NFA to DFA construction involves a kind of memoization. How so?
The algorithm generates subsets of NFA states for imaginary inputs,
and turns those sets of NFA states into single DFA states. As it
probes into that space, it has to recognize subsets it has already
generated before! Each such a seen-before subset represents an
existing DFA state that has already been created. The transitions
should go to that state rather than making a new one which will
turn out identical.)

(Also, caching approaches are possible in simulating an NFA:
instead of computing a DFA ahead of time, we lazily generate some of
it as we go through the input, and cache some of it. Say up to 32
states, with some LRU replacement or whtaever. It's like "JIT" for
NFAs. We get some of the speed benefit of DFA, but the state
explosion of the DFA is thwarted.)

(I personaly am not so interested in backtracking regexes; and they are
not directly applicable to compilers. You'd never want to design a
lexical analyzer that requires "lookaround" or whatnot in order to
recognize tokens. It's nice they are generating Arxiv papers for
someone; no genuine intellectual inquiry is invalid; good luck! It's
not applicable to the "Programming Languages" category it has been filed
under. I will skim through it, though.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
[You're right, it's polynomial, but that's bad enough.

Ambiguous is if there are multiple subpatterns that can match the
same string, e.g. (foo|[a-z]+) matching "foo".  Often your code cares
which one it matches even though that's arguably a bug.  -John]