Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Ben Hanson Newsgroups: comp.compilers Subject: Re: Regular expression string searching & matching Date: Wed, 7 Mar 2018 11:53:15 -0800 (PST) Organization: Compilers Central Lines: 97 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <18-03-032@comp.compilers> References: <18-03-016@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="18151"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 07 Mar 2018 14:59:08 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:1982 [/][*]([^*]|[*]+[^*/])*[*]+[/] is what you are looking for. I ran into this when developing my lexer generator library lexertl in C++. Having a debug::dump() function really helped me grok what was going on. The trick of course is realising that you have to exclude the characters that follow (i.e. the [^*/] part). That is the bit that clobbers the greedy behaviour. I've had to remind myself of that on more than one occasion recently! Hope that helps. Regards, Ben On Sunday, 4 March 2018 14:41:17 UTC, Clint O wrote: > Hi: > > I'm working on an implementation of Brzozowski derivatives[1] to translate > regular expressions into a DFAs for potential use as a RE debugger and > programming tool. I've been using the paper "Regular-expression derivatives > reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation. > > One of the interesting things about using derivatives is that it allows for > some elegant RE set notation. So, rather than just the union - > > r: a | b # a OR b > > we also have set intersection and complement as well. > > I've got things working pretty well, and I'm trying to test things out. > Unfortunately, I haven't seen much discussion on how string search works, > except for what little I know about the behavior of flex and grep. > > Match is pretty straightforward: You begin in the starting state q0 and read > the input and follow the state transitions. If you get to the end of the > string without hitting the error state and you are in an accepting state > (Brzozowski refers to states as "nullable"), then this string is accepted by > this RE. > > Search is a different beast: Here we have a string of input and we are > interested in knowing if any substring in the input matches our RE (like > grep). > > The naive approach I have taken has been to start the DFA at every possible > position in the string, and start simulating the DFA. If I hit any accepting > states, I record the start, stop offsets. If I hit the error state or end of > string, I move to the next starting position. I then merge all of the > intervals at the end to cover the cases where I hit an accepting state > multiple times for a given start point. Note: I assume this is the faithful > way to do it because it's my understanding that we should aim for the longest > possible match. If I stop at the first accepting state, I will miss > potentially longer matches. > > I hit an interesting case when trying out finding C-comments. Owens mentions > that with complement you can find non-nested C-comments with: > > /\* ~(.* \*/ .*) \*/ > > and this works for me. However, I wasn't able to find an RE _without_ using > complement that exhibits identical behavior because of the inherent greedy > nature of REs that make it scan through successive non-overlapping comments in > the text until the final closing '*/'. > > There was an interesting stackoverflow discussion[3] on this with links to > explanatory regex interpreters, and when I tried: > > /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ > > for me it continues through the input until it finds the _last_ '*/'. > > I also tried constructing my own versions and came up with similar results. > > Questions: > > 1) Am I applying the RE/DFA search algorithm correctly? > > 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using > an RE->DFA implementation? A quick search made it sound like only backtracking > implementations can implement this behavior? > > Thanks, > > -Clint > > > [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal > of the ACM, 11(4), 481-494. > > [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives > re-examined. Journal of Functional Programming 19 (2), 173-190 > > [3] > https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment