Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Ben Hanson <jamin.hanson@googlemail.com>
Newsgroups: comp.compilers
Subject: Re: Regular expression string searching & matching
Date: Wed, 7 Mar 2018 11:53:15 -0800 (PST)
Organization: Compilers Central
Lines: 97
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <18-03-032@comp.compilers>
References: <18-03-016@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="18151"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 07 Mar 2018 14:59:08 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:1982

[/][*]([^*]|[*]+[^*/])*[*]+[/]

is what you are looking for. I ran into this when developing my lexer
generator library lexertl in C++. Having a debug::dump() function
really helped me grok what was going on.

The trick of course is realising that you have to exclude the
characters that follow (i.e. the [^*/] part). That is the bit that
clobbers the greedy behaviour. I've had to remind myself of that on
more than one occasion recently!

Hope that helps.

Regards,

Ben

On Sunday, 4 March 2018 14:41:17 UTC, Clint O  wrote:
> Hi:
>
> I'm working on an implementation of Brzozowski derivatives[1] to translate
> regular expressions into a DFAs for potential use as a RE debugger and
> programming tool. I've been using the paper "Regular-expression derivatives
> reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.
>
> One of the interesting things about using derivatives is that it allows for
> some elegant RE set notation. So, rather than just the union -
>
> r: a | b  # a OR b
>
> we also have set intersection and complement as well.
>
> I've got things working pretty well, and I'm trying to test things out.
> Unfortunately, I haven't seen much discussion on how string search works,
> except for what little I know about the behavior of flex and grep.
>
> Match is pretty straightforward: You begin in the starting state q0 and read
> the input and follow the state transitions. If you get to the end of the
> string without hitting the error state and you are in an accepting state
> (Brzozowski refers to states as "nullable"), then this string is accepted by
> this RE.
>
> Search is a different beast: Here we have a string of input and we are
> interested in knowing if any substring in the input matches our RE (like
> grep).
>
> The naive approach I have taken has been to start the DFA at every possible
> position in the string, and start simulating the DFA. If I hit any accepting
> states, I record the start, stop offsets. If I hit the error state or end of
> string, I move to the next starting position. I then merge all of the
> intervals at the end to cover the cases where I hit an accepting state
> multiple times for a given start point. Note: I assume this is the faithful
> way to do it because it's my understanding that we should aim for the longest
> possible match. If I stop at the first accepting state, I will miss
> potentially longer matches.
>
> I hit an interesting case when trying out finding C-comments. Owens mentions
> that with complement you can find non-nested C-comments with:
>
> /\* ~(.* \*/ .*) \*/
>
> and this works for me. However, I wasn't able to find an RE _without_ using
> complement that exhibits identical behavior because of the inherent greedy
> nature of REs that make it scan through successive non-overlapping comments in
> the text until the final closing '*/'.
>
> There was an interesting stackoverflow discussion[3] on this with links to
> explanatory regex interpreters, and when I tried:
>
> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
>
> for me it continues through the input until it finds the _last_ '*/'.
>
> I also tried constructing my own versions and came up with similar results.
>
> Questions:
>
> 1) Am I applying the RE/DFA search algorithm correctly?
>
> 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
> an RE->DFA implementation? A quick search made it sound like only backtracking
> implementations can implement this behavior?
>
> Thanks,
>
> -Clint
>
>
> [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
> of the ACM, 11(4), 481-494.
>
> [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
> re-examined. Journal of Functional Programming 19 (2), 173-190
>
> [3]
> https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment