Re: Interesting paper on regex NFA matching

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	Re: Interesting paper on regex NFA matching
Date	2024-01-27 12:47 +0200
Organization	Compilers Central
Message-ID	<24-01-006@comp.compilers> (permalink)

Show all headers | View raw

Kaz Kylheku <433-929-6894@kylheku.com> wrote:

> (I personaly am not so interested in backtracking regexes; and they are
> not directly applicable to compilers. You'd never want to design a
> lexical analyzer that requires "lookaround" or whatnot in order to
> recognize tokens. It's nice they are generating Arxiv papers for
> someone; no genuine intellectual inquiry is invalid; good luck! It's
> not applicable to the "Programming Languages" category it has been filed
> under. I will skim through it, though.)

You are mistaken that they are not applicable to programming languages.

Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.

The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.

In any case, parser generators like ANTLR use backtracking to great
effect (and yes, we "borrowed" that idea in Yacc++) and so did the
entire PEG (parsing expression grammar) community, using memoizartion
to give linear time performance in many cases.

Thus, to me, it is no surprise to see this reintegrated back into the
PCRE world.

Moreover, even if it is of little interest [mistakenly in my opinion] to the
"programming language" community, it is of definite interest in the
"real world:"  PCRE regular expressions are both used (and abused)
in many applications, including SNORT where linear performance and
preventing DDOS attacks would be of prime importance.  (While at Intel,
I had the honor of working with the Hyperscan (Sensory Networks) team
and improving PCRE performance for SNORT was a major "selling point"

And "greedy" (longest match), "non-greedy" (shortest match) and other
variations like "lexically first in the grammar" matching all have their
applications.  Saying that only one of them is correct is ignoring the
needs of actual users.

And, yes, we had discussions about "automata theory correct" interpretations
of what certain PCRE expressions meant and how even "canonical"
implementations had various "bugs" that made their meaning ambiguous.

By the way, to give credit where due, we "borrowed" the idea of
"tunnel automatons" from Josef Grosch to implement controlled
backtracking in Yacc++. It lets one do subset construction (i.e.
the LR(0) algorithm) for items, breaking out for "syntactic predicates"
exactly as needed.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

Back to comp.compilers | Previous | Next — Next in thread | Find similar

Thread

Re: Interesting paper on regex NFA matching Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2024-01-27 12:47 +0200
  Re: Interesting paper on regex NFA matching Kaz Kylheku <433-929-6894@kylheku.com> - 2024-01-27 20:20 +0000

csiph-web