Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: Interesting paper on regex NFA matching
Date: Thu, 01 Feb 2024 04:09:48 +0200
Organization: Compilers Central
Sender: johnl%iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <24-01-010@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="25978"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 31 Jan 2024 22:48:43 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3552

Kaz Kylheku wrote:

> But some backtracking is required for recognizing the longest matching
> prefixes of the input, even if we use nothing but basic regex features!

The key word in your sentence is "prefixes".  The recognition problem
is not about prefixes, but about complete strings.  This is why my answer
mentioned that the distinction between lexing (and parsing) and recognition
is not covered in any automata books I have read.

The definition of regular expressions is that when the finite automata
halts (at the end of the string or when there are no transitions out
of the state that it is in) it is either in an accepting state or not.  That's
the language recognition problem.  There is no mention of "backing
up to the last state that accepted."  And for language recognition
that is important, otherwise, any random trailing garbage after a
legal string would still be considered a match because a prefix
matched.

Thus, given your regular expression (a*b)+.,

b, ab, aab, aaab, ... , bb, abb, aabb, ..., bab, baab, ....,  aaaabaaab
are all in the language, but aaabaaabaaac does not match that
regular expression.  A prefix of it does, but not the entire string.
And, thus the string aaabaaabaaac does not satisfy the language
recognition problem.

Now, when you ask does a string which is a prefix of aaabaaabaaac
match the regular expression, you get two strings which qualify aaab
and aaabaab.But that's not the *language recognition* problem.
That's a different problem, one relevant to lexing.

However, either string satisfies the question of is this string in the language.
Thus, both strings aaab and aaabaaab satisfy the language recognition
problem.  Only one satisfies your definition of the "correct" lexing problem
because you are fixated on a specific extension of how regular expressions
need to work.  It is not a bad extension, but it is an extension of the
meaning of regular expressions in the lexing context which is not the
context of language recognition which is where regular expressions
were defined.

Notably, if one focuses on the language recognition problem, the one
where prefixes don't count, only whole strings do, then whether the |
operator is commutative or not, does not matter.  It only matters when
one can terminate (and accept) without considering the entire string.

Still, there are cases where one might prefer to get the shortest answer,
rather than the longest answer.  Or cases, where one might prefer a
shorter answer when two possible answers match.  Those situations
require a more complex language.  A language which regular expressions
are only a subset of.  This class of languages is what the PCRE (and PEG)
extensions allow one to access.  It is possible to express every regular
expression in those languages, trivially so with PCRE and with only a
small amount of care with PEGs (whose / operator is non-commutative
and chosen that way not to be confused with the | operator).   And,
moreover, you can get the same results for prefix matching if that is
what you desire.

However, you can also get results where other matches are the ones
that are "preferred" and returned.  The user simply has more choices.
Now, if one hasn't chosen wisely and one has carelessly written an
"extended" regular expression which returns a different preferred choice
than the one which one expected, that is the fault of the writer not the
formalism.  People are just as capable of writing buggy regular expressions
as they are of writing buggy programs, and that's true whether using
the extensions or not.  Maybe the user really wanted "(a+b)+" or the
more complex "(a*b)+a*" or some other variant.

And while there are buggy implementations of PCRE regular expressions,
that doesn't mean the formalism itself is bad.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------