Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Clint O <clint.olsen@gmail.com>
Newsgroups: comp.compilers
Subject: Regular expression string searching & matching
Date: Sun, 4 Mar 2018 01:37:02 -0800 (PST)
Organization: Compilers Central
Lines: 78
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <18-03-016@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="67578"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, DFA, question
Posted-Date: 04 Mar 2018 09:41:14 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:1973

Hi:

I'm working on an implementation of Brzozowski derivatives[1] to translate
regular expressions into a DFAs for potential use as a RE debugger and
programming tool. I've been using the paper "Regular-expression derivatives
reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.

One of the interesting things about using derivatives is that it allows for
some elegant RE set notation. So, rather than just the union -

r: a | b  # a OR b

we also have set intersection and complement as well.

I've got things working pretty well, and I'm trying to test things out.
Unfortunately, I haven't seen much discussion on how string search works,
except for what little I know about the behavior of flex and grep.

Match is pretty straightforward: You begin in the starting state q0 and read
the input and follow the state transitions. If you get to the end of the
string without hitting the error state and you are in an accepting state
(Brzozowski refers to states as "nullable"), then this string is accepted by
this RE.

Search is a different beast: Here we have a string of input and we are
interested in knowing if any substring in the input matches our RE (like
grep).

The naive approach I have taken has been to start the DFA at every possible
position in the string, and start simulating the DFA. If I hit any accepting
states, I record the start, stop offsets. If I hit the error state or end of
string, I move to the next starting position. I then merge all of the
intervals at the end to cover the cases where I hit an accepting state
multiple times for a given start point. Note: I assume this is the faithful
way to do it because it's my understanding that we should aim for the longest
possible match. If I stop at the first accepting state, I will miss
potentially longer matches.

I hit an interesting case when trying out finding C-comments. Owens mentions
that with complement you can find non-nested C-comments with:

/\* ~(.* \*/ .*) \*/

and this works for me. However, I wasn't able to find an RE _without_ using
complement that exhibits identical behavior because of the inherent greedy
nature of REs that make it scan through successive non-overlapping comments in
the text until the final closing '*/'.

There was an interesting stackoverflow discussion[3] on this with links to
explanatory regex interpreters, and when I tried:

/\*[^*]*\*+(?:[^/*][^*]*\*+)*/

for me it continues through the input until it finds the _last_ '*/'.

I also tried constructing my own versions and came up with similar results.

Questions:

1) Am I applying the RE/DFA search algorithm correctly?

2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
an RE->DFA implementation? A quick search made it sound like only backtracking
implementations can implement this behavior?

Thanks,

-Clint


[1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
of the ACM, 11(4), 481-494.

[2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
re-examined. Journal of Functional Programming 19 (2), 173-190

[3]
https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment