Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #1982
| Path | csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end |
|---|---|
| From | Ben Hanson <jamin.hanson@googlemail.com> |
| Newsgroups | comp.compilers |
| Subject | Re: Regular expression string searching & matching |
| Date | Wed, 7 Mar 2018 11:53:15 -0800 (PST) |
| Organization | Compilers Central |
| Lines | 97 |
| Sender | news@iecc.com |
| Approved | comp.compilers@iecc.com |
| Message-ID | <18-03-032@comp.compilers> (permalink) |
| References | <18-03-016@comp.compilers> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="UTF-8" |
| Injection-Info | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="18151"; mail-complaints-to="abuse@iecc.com" |
| Keywords | lex |
| Posted-Date | 07 Mar 2018 14:59:08 EST |
| X-submission-address | compilers@iecc.com |
| X-moderator-address | compilers-request@iecc.com |
| X-FAQ-and-archives | http://compilers.iecc.com |
| Xref | csiph.com comp.compilers:1982 |
Show key headers only | View raw
[/][*]([^*]|[*]+[^*/])*[*]+[/] is what you are looking for. I ran into this when developing my lexer generator library lexertl in C++. Having a debug::dump() function really helped me grok what was going on. The trick of course is realising that you have to exclude the characters that follow (i.e. the [^*/] part). That is the bit that clobbers the greedy behaviour. I've had to remind myself of that on more than one occasion recently! Hope that helps. Regards, Ben On Sunday, 4 March 2018 14:41:17 UTC, Clint O wrote: > Hi: > > I'm working on an implementation of Brzozowski derivatives[1] to translate > regular expressions into a DFAs for potential use as a RE debugger and > programming tool. I've been using the paper "Regular-expression derivatives > reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation. > > One of the interesting things about using derivatives is that it allows for > some elegant RE set notation. So, rather than just the union - > > r: a | b # a OR b > > we also have set intersection and complement as well. > > I've got things working pretty well, and I'm trying to test things out. > Unfortunately, I haven't seen much discussion on how string search works, > except for what little I know about the behavior of flex and grep. > > Match is pretty straightforward: You begin in the starting state q0 and read > the input and follow the state transitions. If you get to the end of the > string without hitting the error state and you are in an accepting state > (Brzozowski refers to states as "nullable"), then this string is accepted by > this RE. > > Search is a different beast: Here we have a string of input and we are > interested in knowing if any substring in the input matches our RE (like > grep). > > The naive approach I have taken has been to start the DFA at every possible > position in the string, and start simulating the DFA. If I hit any accepting > states, I record the start, stop offsets. If I hit the error state or end of > string, I move to the next starting position. I then merge all of the > intervals at the end to cover the cases where I hit an accepting state > multiple times for a given start point. Note: I assume this is the faithful > way to do it because it's my understanding that we should aim for the longest > possible match. If I stop at the first accepting state, I will miss > potentially longer matches. > > I hit an interesting case when trying out finding C-comments. Owens mentions > that with complement you can find non-nested C-comments with: > > /\* ~(.* \*/ .*) \*/ > > and this works for me. However, I wasn't able to find an RE _without_ using > complement that exhibits identical behavior because of the inherent greedy > nature of REs that make it scan through successive non-overlapping comments in > the text until the final closing '*/'. > > There was an interesting stackoverflow discussion[3] on this with links to > explanatory regex interpreters, and when I tried: > > /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ > > for me it continues through the input until it finds the _last_ '*/'. > > I also tried constructing my own versions and came up with similar results. > > Questions: > > 1) Am I applying the RE/DFA search algorithm correctly? > > 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using > an RE->DFA implementation? A quick search made it sound like only backtracking > implementations can implement this behavior? > > Thanks, > > -Clint > > > [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal > of the ACM, 11(4), 481-494. > > [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives > re-examined. Journal of Functional Programming 19 (2), 173-190 > > [3] > https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-04 01:37 -0800
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 11:53 -0800
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 12:18 -0800
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-08 22:53 -0800
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-10 00:57 -0800
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-11 13:52 -0700
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-12 14:00 -0700
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 11:30 -0700
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-17 16:52 -0700
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-18 19:23 +0000
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-20 17:23 +0000
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-22 17:46 +0000
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-12 15:46 -0700
Re: Regular expression string searching & matching Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2018-03-13 10:53 +0100
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 14:23 -0700
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-10 03:17 -0800
csiph-web