Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Clint O Newsgroups: comp.compilers Subject: Regular expression string searching & matching Date: Sun, 4 Mar 2018 01:37:02 -0800 (PST) Organization: Compilers Central Lines: 78 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <18-03-016@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="67578"; mail-complaints-to="abuse@iecc.com" Keywords: lex, DFA, question Posted-Date: 04 Mar 2018 09:41:14 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:1973 Hi: I'm working on an implementation of Brzozowski derivatives[1] to translate regular expressions into a DFAs for potential use as a RE debugger and programming tool. I've been using the paper "Regular-expression derivatives reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation. One of the interesting things about using derivatives is that it allows for some elegant RE set notation. So, rather than just the union - r: a | b # a OR b we also have set intersection and complement as well. I've got things working pretty well, and I'm trying to test things out. Unfortunately, I haven't seen much discussion on how string search works, except for what little I know about the behavior of flex and grep. Match is pretty straightforward: You begin in the starting state q0 and read the input and follow the state transitions. If you get to the end of the string without hitting the error state and you are in an accepting state (Brzozowski refers to states as "nullable"), then this string is accepted by this RE. Search is a different beast: Here we have a string of input and we are interested in knowing if any substring in the input matches our RE (like grep). The naive approach I have taken has been to start the DFA at every possible position in the string, and start simulating the DFA. If I hit any accepting states, I record the start, stop offsets. If I hit the error state or end of string, I move to the next starting position. I then merge all of the intervals at the end to cover the cases where I hit an accepting state multiple times for a given start point. Note: I assume this is the faithful way to do it because it's my understanding that we should aim for the longest possible match. If I stop at the first accepting state, I will miss potentially longer matches. I hit an interesting case when trying out finding C-comments. Owens mentions that with complement you can find non-nested C-comments with: /\* ~(.* \*/ .*) \*/ and this works for me. However, I wasn't able to find an RE _without_ using complement that exhibits identical behavior because of the inherent greedy nature of REs that make it scan through successive non-overlapping comments in the text until the final closing '*/'. There was an interesting stackoverflow discussion[3] on this with links to explanatory regex interpreters, and when I tried: /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ for me it continues through the input until it finds the _last_ '*/'. I also tried constructing my own versions and came up with similar results. Questions: 1) Am I applying the RE/DFA search algorithm correctly? 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using an RE->DFA implementation? A quick search made it sound like only backtracking implementations can implement this behavior? Thanks, -Clint [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal of the ACM, 11(4), 481-494. [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives re-examined. Journal of Functional Programming 19 (2), 173-190 [3] https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment