Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.compilers > #1982

Re: Regular expression string searching & matching

From Ben Hanson <jamin.hanson@googlemail.com>
Newsgroups comp.compilers
Subject Re: Regular expression string searching & matching
Date 2018-03-07 11:53 -0800
Organization Compilers Central
Message-ID <18-03-032@comp.compilers> (permalink)
References <18-03-016@comp.compilers>

Show all headers | View raw


[/][*]([^*]|[*]+[^*/])*[*]+[/]

is what you are looking for. I ran into this when developing my lexer
generator library lexertl in C++. Having a debug::dump() function
really helped me grok what was going on.

The trick of course is realising that you have to exclude the
characters that follow (i.e. the [^*/] part). That is the bit that
clobbers the greedy behaviour. I've had to remind myself of that on
more than one occasion recently!

Hope that helps.

Regards,

Ben

On Sunday, 4 March 2018 14:41:17 UTC, Clint O  wrote:
> Hi:
>
> I'm working on an implementation of Brzozowski derivatives[1] to translate
> regular expressions into a DFAs for potential use as a RE debugger and
> programming tool. I've been using the paper "Regular-expression derivatives
> reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.
>
> One of the interesting things about using derivatives is that it allows for
> some elegant RE set notation. So, rather than just the union -
>
> r: a | b  # a OR b
>
> we also have set intersection and complement as well.
>
> I've got things working pretty well, and I'm trying to test things out.
> Unfortunately, I haven't seen much discussion on how string search works,
> except for what little I know about the behavior of flex and grep.
>
> Match is pretty straightforward: You begin in the starting state q0 and read
> the input and follow the state transitions. If you get to the end of the
> string without hitting the error state and you are in an accepting state
> (Brzozowski refers to states as "nullable"), then this string is accepted by
> this RE.
>
> Search is a different beast: Here we have a string of input and we are
> interested in knowing if any substring in the input matches our RE (like
> grep).
>
> The naive approach I have taken has been to start the DFA at every possible
> position in the string, and start simulating the DFA. If I hit any accepting
> states, I record the start, stop offsets. If I hit the error state or end of
> string, I move to the next starting position. I then merge all of the
> intervals at the end to cover the cases where I hit an accepting state
> multiple times for a given start point. Note: I assume this is the faithful
> way to do it because it's my understanding that we should aim for the longest
> possible match. If I stop at the first accepting state, I will miss
> potentially longer matches.
>
> I hit an interesting case when trying out finding C-comments. Owens mentions
> that with complement you can find non-nested C-comments with:
>
> /\* ~(.* \*/ .*) \*/
>
> and this works for me. However, I wasn't able to find an RE _without_ using
> complement that exhibits identical behavior because of the inherent greedy
> nature of REs that make it scan through successive non-overlapping comments in
> the text until the final closing '*/'.
>
> There was an interesting stackoverflow discussion[3] on this with links to
> explanatory regex interpreters, and when I tried:
>
> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
>
> for me it continues through the input until it finds the _last_ '*/'.
>
> I also tried constructing my own versions and came up with similar results.
>
> Questions:
>
> 1) Am I applying the RE/DFA search algorithm correctly?
>
> 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
> an RE->DFA implementation? A quick search made it sound like only backtracking
> implementations can implement this behavior?
>
> Thanks,
>
> -Clint
>
>
> [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
> of the ACM, 11(4), 481-494.
>
> [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
> re-examined. Journal of Functional Programming 19 (2), 173-190
>
> [3]
> https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment

Back to comp.compilers | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-04 01:37 -0800
  Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 11:53 -0800
    Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 12:18 -0800
    Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-08 22:53 -0800
      Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-10 00:57 -0800
        Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-11 13:52 -0700
          Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-12 14:00 -0700
            Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 11:30 -0700
              Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-17 16:52 -0700
              Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-18 19:23 +0000
              Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-20 17:23 +0000
                Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-22 17:46 +0000
          Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-12 15:46 -0700
          Re: Regular expression string searching & matching Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2018-03-13 10:53 +0100
            Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 14:23 -0700
      Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-10 03:17 -0800

csiph-web