Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.compilers > #3548 > unrolled thread

Re: Interesting paper on regex NFA matching

Started byChristopher F Clark <christopher.f.clark@compiler-resources.com>
First post2024-01-27 12:47 +0200
Last post2024-01-27 20:20 +0000
Articles 2 — 2 participants

Back to article view | Back to comp.compilers


Contents

  Re: Interesting paper on regex NFA matching Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2024-01-27 12:47 +0200
    Re: Interesting paper on regex NFA matching Kaz Kylheku <433-929-6894@kylheku.com> - 2024-01-27 20:20 +0000

#3548 — Re: Interesting paper on regex NFA matching

FromChristopher F Clark <christopher.f.clark@compiler-resources.com>
Date2024-01-27 12:47 +0200
SubjectRe: Interesting paper on regex NFA matching
Message-ID<24-01-006@comp.compilers>
Kaz Kylheku <433-929-6894@kylheku.com> wrote:

> (I personaly am not so interested in backtracking regexes; and they are
> not directly applicable to compilers. You'd never want to design a
> lexical analyzer that requires "lookaround" or whatnot in order to
> recognize tokens. It's nice they are generating Arxiv papers for
> someone; no genuine intellectual inquiry is invalid; good luck! It's
> not applicable to the "Programming Languages" category it has been filed
> under. I will skim through it, though.)

You are mistaken that they are not applicable to programming languages.

Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.

The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.

In any case, parser generators like ANTLR use backtracking to great
effect (and yes, we "borrowed" that idea in Yacc++) and so did the
entire PEG (parsing expression grammar) community, using memoizartion
to give linear time performance in many cases.

Thus, to me, it is no surprise to see this reintegrated back into the
PCRE world.

Moreover, even if it is of little interest [mistakenly in my opinion] to the
"programming language" community, it is of definite interest in the
"real world:"  PCRE regular expressions are both used (and abused)
in many applications, including SNORT where linear performance and
preventing DDOS attacks would be of prime importance.  (While at Intel,
I had the honor of working with the Hyperscan (Sensory Networks) team
and improving PCRE performance for SNORT was a major "selling point"

And "greedy" (longest match), "non-greedy" (shortest match) and other
variations like "lexically first in the grammar" matching all have their
applications.  Saying that only one of them is correct is ignoring the
needs of actual users.

And, yes, we had discussions about "automata theory correct" interpretations
of what certain PCRE expressions meant and how even "canonical"
implementations had various "bugs" that made their meaning ambiguous.

By the way, to give credit where due, we "borrowed" the idea of
"tunnel automatons" from Josef Grosch to implement controlled
backtracking in Yacc++. It lets one do subset construction (i.e.
the LR(0) algorithm) for items, breaking out for "syntactic predicates"
exactly as needed.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [next] | [standalone]


#3549

FromKaz Kylheku <433-929-6894@kylheku.com>
Date2024-01-27 20:20 +0000
Message-ID<24-01-007@comp.compilers>
In reply to#3548
On 2024-01-27, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> Kaz Kylheku <433-929-6894@kylheku.com> wrote:
>
>> (I personaly am not so interested in backtracking regexes; and they are
>> not directly applicable to compilers. You'd never want to design a
>> lexical analyzer that requires "lookaround" or whatnot in order to
>> recognize tokens. It's nice they are generating Arxiv papers for
>> someone; no genuine intellectual inquiry is invalid; good luck! It's
>> not applicable to the "Programming Languages" category it has been filed
>> under. I will skim through it, though.)
>
> You are mistaken that they are not applicable to programming languages.
>
> Even Pascal [originally] had cases where lookahead and backtracking
> are required to properly tokenize the input.
>
> The well-known case is "1." if followed by another "." it is two tokens,
> and if followed by a digit it is a floating point number.
> I don't recall what it should be if followed by any other character.

I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.

  {digits}/. {
    // ... set up semantic value ..
    return INTEGER;
  }

  {digits}.{digits} {
    // ... set up semantic value ..
    return FLOAT;
  }

Another possibility is to recongize {digits}. as one lexeme and call
unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]

[toc] | [prev] | [standalone]


Back to top | Article view | comp.compilers


csiph-web