Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #1973 > unrolled thread
| Started by | Clint O <clint.olsen@gmail.com> |
|---|---|
| First post | 2018-03-04 01:37 -0800 |
| Last post | 2018-03-10 03:17 -0800 |
| Articles | 16 — 3 participants |
Back to article view | Back to comp.compilers
Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-04 01:37 -0800
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 11:53 -0800
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 12:18 -0800
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-08 22:53 -0800
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-10 00:57 -0800
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-11 13:52 -0700
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-12 14:00 -0700
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 11:30 -0700
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-17 16:52 -0700
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-18 19:23 +0000
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-20 17:23 +0000
Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-22 17:46 +0000
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-12 15:46 -0700
Re: Regular expression string searching & matching Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2018-03-13 10:53 +0100
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 14:23 -0700
Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-10 03:17 -0800
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-04 01:37 -0800 |
| Subject | Regular expression string searching & matching |
| Message-ID | <18-03-016@comp.compilers> |
Hi: I'm working on an implementation of Brzozowski derivatives[1] to translate regular expressions into a DFAs for potential use as a RE debugger and programming tool. I've been using the paper "Regular-expression derivatives reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation. One of the interesting things about using derivatives is that it allows for some elegant RE set notation. So, rather than just the union - r: a | b # a OR b we also have set intersection and complement as well. I've got things working pretty well, and I'm trying to test things out. Unfortunately, I haven't seen much discussion on how string search works, except for what little I know about the behavior of flex and grep. Match is pretty straightforward: You begin in the starting state q0 and read the input and follow the state transitions. If you get to the end of the string without hitting the error state and you are in an accepting state (Brzozowski refers to states as "nullable"), then this string is accepted by this RE. Search is a different beast: Here we have a string of input and we are interested in knowing if any substring in the input matches our RE (like grep). The naive approach I have taken has been to start the DFA at every possible position in the string, and start simulating the DFA. If I hit any accepting states, I record the start, stop offsets. If I hit the error state or end of string, I move to the next starting position. I then merge all of the intervals at the end to cover the cases where I hit an accepting state multiple times for a given start point. Note: I assume this is the faithful way to do it because it's my understanding that we should aim for the longest possible match. If I stop at the first accepting state, I will miss potentially longer matches. I hit an interesting case when trying out finding C-comments. Owens mentions that with complement you can find non-nested C-comments with: /\* ~(.* \*/ .*) \*/ and this works for me. However, I wasn't able to find an RE _without_ using complement that exhibits identical behavior because of the inherent greedy nature of REs that make it scan through successive non-overlapping comments in the text until the final closing '*/'. There was an interesting stackoverflow discussion[3] on this with links to explanatory regex interpreters, and when I tried: /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ for me it continues through the input until it finds the _last_ '*/'. I also tried constructing my own versions and came up with similar results. Questions: 1) Am I applying the RE/DFA search algorithm correctly? 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using an RE->DFA implementation? A quick search made it sound like only backtracking implementations can implement this behavior? Thanks, -Clint [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal of the ACM, 11(4), 481-494. [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives re-examined. Journal of Functional Programming 19 (2), 173-190 [3] https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment
[toc] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-07 11:53 -0800 |
| Message-ID | <18-03-032@comp.compilers> |
| In reply to | #1973 |
[/][*]([^*]|[*]+[^*/])*[*]+[/] is what you are looking for. I ran into this when developing my lexer generator library lexertl in C++. Having a debug::dump() function really helped me grok what was going on. The trick of course is realising that you have to exclude the characters that follow (i.e. the [^*/] part). That is the bit that clobbers the greedy behaviour. I've had to remind myself of that on more than one occasion recently! Hope that helps. Regards, Ben On Sunday, 4 March 2018 14:41:17 UTC, Clint O wrote: > Hi: > > I'm working on an implementation of Brzozowski derivatives[1] to translate > regular expressions into a DFAs for potential use as a RE debugger and > programming tool. I've been using the paper "Regular-expression derivatives > reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation. > > One of the interesting things about using derivatives is that it allows for > some elegant RE set notation. So, rather than just the union - > > r: a | b # a OR b > > we also have set intersection and complement as well. > > I've got things working pretty well, and I'm trying to test things out. > Unfortunately, I haven't seen much discussion on how string search works, > except for what little I know about the behavior of flex and grep. > > Match is pretty straightforward: You begin in the starting state q0 and read > the input and follow the state transitions. If you get to the end of the > string without hitting the error state and you are in an accepting state > (Brzozowski refers to states as "nullable"), then this string is accepted by > this RE. > > Search is a different beast: Here we have a string of input and we are > interested in knowing if any substring in the input matches our RE (like > grep). > > The naive approach I have taken has been to start the DFA at every possible > position in the string, and start simulating the DFA. If I hit any accepting > states, I record the start, stop offsets. If I hit the error state or end of > string, I move to the next starting position. I then merge all of the > intervals at the end to cover the cases where I hit an accepting state > multiple times for a given start point. Note: I assume this is the faithful > way to do it because it's my understanding that we should aim for the longest > possible match. If I stop at the first accepting state, I will miss > potentially longer matches. > > I hit an interesting case when trying out finding C-comments. Owens mentions > that with complement you can find non-nested C-comments with: > > /\* ~(.* \*/ .*) \*/ > > and this works for me. However, I wasn't able to find an RE _without_ using > complement that exhibits identical behavior because of the inherent greedy > nature of REs that make it scan through successive non-overlapping comments in > the text until the final closing '*/'. > > There was an interesting stackoverflow discussion[3] on this with links to > explanatory regex interpreters, and when I tried: > > /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ > > for me it continues through the input until it finds the _last_ '*/'. > > I also tried constructing my own versions and came up with similar results. > > Questions: > > 1) Am I applying the RE/DFA search algorithm correctly? > > 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using > an RE->DFA implementation? A quick search made it sound like only backtracking > implementations can implement this behavior? > > Thanks, > > -Clint > > > [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal > of the ACM, 11(4), 481-494. > > [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives > re-examined. Journal of Functional Programming 19 (2), 173-190 > > [3] > https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-07 12:18 -0800 |
| Message-ID | <18-03-033@comp.compilers> |
| In reply to | #1982 |
I missed your question about non-greedy repeats. Yes, it is possible. See build_dfa() in generator.hpp from lexertl. Basically non-greedy transitions are snipped when building the dfa. I build a regex syntax tree as suggested in the Dragon Book and I keep track of greedy flags in the tree and that is passed down to partition/equivset.hpp and from there to the generator. The thing you have to careful about is respecting that the left side takes priority (i.e. the regex or sub-regex that came first). Regards, Ben
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-08 22:53 -0800 |
| Message-ID | <18-03-034@comp.compilers> |
| In reply to | #1982 |
Hi Ben: Thanks for your post. I did try your regular expression (and a few small variations on it), but it exhibits the same behavior as the others I have tried. The difference with the complement version is that the accepting state I end up with has all transitions to the error state (which guarantees termination after match) where as these seem to still accept characters even after matching the closing '*/'. It's possible I have a bug in my implementation, so I'm still looking at it. Thanks, -Clint On Wednesday, March 7, 2018 at 11:59:10 AM UTC-8, Ben Hanson wrote: > [/][*]([^*]|[*]+[^*/])*[*]+[/] > > is what you are looking for. I ran into this when developing my lexer > generator library lexertl in C++. Having a debug::dump() function > really helped me grok what was going on. > > The trick of course is realising that you have to exclude the > characters that follow (i.e. the [^*/] part). That is the bit that > clobbers the greedy behaviour. I've had to remind myself of that on > more than one occasion recently! [This should work, it's a standard example in compiler texts. -John]
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-10 00:57 -0800 |
| Message-ID | <18-03-035@comp.compilers> |
| In reply to | #1984 |
On Friday, March 9, 2018 at 6:47:09 AM UTC-8, Clint O wrote:
> On Wednesday, March 7, 2018 at 11:59:10 AM UTC-8, Ben Hanson wrote:
> > [/][*]([^*]|[*]+[^*/])*[*]+[/]
>
> [This should work, it's a standard example in compiler texts. -John]
/This/ actually worked for me (one character change):
[/][*]([^*]|[*]+[^/])*[*]+[/]
^
Thanks for looking at this!
-Clint
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-11 13:52 -0700 |
| Message-ID | <18-03-041@comp.compilers> |
| In reply to | #1985 |
> /This/ actually worked for me (one character change): > > [/][*]([^*]|[*]+[^/])*[*]+[/] Your modified regex produces the following state machine: State: 0 [/] -> 1 State: 1 [*] -> 2 State: 2 [^*] -> 2 [*] -> 3 State: 3 [^*/] -> 2 [*] -> 4 [/] -> 5 State: 4 [^*/] -> 2 [*] -> 4 [/] -> 6 State: 5 END STATE State: 6 END STATE [^*] -> 2 [*] -> 3 Which will match /***/a*/ in its entirety, when if should only match /***/ Regards, Ben [Doesn't that depend on whether you interpret the END STATE in state 6 to stop even if there's more input? -John]
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-12 14:00 -0700 |
| Message-ID | <18-03-045@comp.compilers> |
| In reply to | #1990 |
On Monday, March 12, 2018 at 1:19:29 PM UTC-7, Ben Hanson wrote: > > /This/ actually worked for me (one character change): > > > > [/][*]([^*]|[*]+[^/])*[*]+[/] > > Your modified regex produces the following state machine: > [snip] > > Which will match > > /***/a*/ > > in its entirety, when if should only match > > /***/ > > Regards, > > Ben > [Doesn't that depend on whether you interpret the END STATE in state 6 to stop even > if there's more input? -John] Interesting. I'm not seeing this behavior with the sample input you've provided. Again, I'm willing to concede that I have a bug :) What I'm doing is simulating the DFA until I get to an error state or I hit EOF. So, this guarantees I'll record the longest match I've found. I could post the states that I come up with, but my state dumper also prints out the RE it's currently processing (the actual expression). The successive computation of derivatives can sometimes produce some rather abhorrent output, and it's not always obvious (to me) what's going on. I'll work on a cleaner presentation and try to post this. It also looks like you are running a DFA minimizer (like Hopcroft) on your result since I am not producing a minimal DFA. That also may help me figure out if I'm producing the right automaton because they'd match... Thanks, -Clint
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-13 11:30 -0700 |
| Message-ID | <18-03-054@comp.compilers> |
| In reply to | #1993 |
> It also looks like you are running a DFA minimizer (like Hopcroft) on your > result since I am not producing a minimal DFA. That also may help me figure > out if I'm producing the right automaton because they'd match... Actually, although I have a minimising routine I'm not applying it here. I intersect every character class in the regex to produce non-overlapping sets and then these are grouped in equivalence sets for each state in the DFA. This is the part that I didn't find in the Dragon Book (or any other compiler book I looked in) and it was a message on this group from Vern Paxon that tipped me off: https://compilers.iecc.com/comparch/article/94-10-091 (I only paid attention to the equivalence classes bit by the way!) I intersect the equivalence sets as I build the DFA and I have found with this technique you kind of have to go out of your way to get a non minimal DFA by default. I've imagined an even more rigorous approach where character classes are intersected as they are added to the syntax tree up front, but I've not got around to trying to implement it. I'd be interested to see your DFA output once you've cleaned it up a bit. Using derivatives is interesting too - I've seen it discussed but never tried to get into that. Regards, Ben
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-17 16:52 -0700 |
| Message-ID | <18-03-073@comp.compilers> |
| In reply to | #2000 |
On Tuesday, March 13, 2018 at 12:36:32 PM UTC-7, Ben Hanson wrote:
> I'd be interested to see your DFA output once you've cleaned it up a bit.
> Using derivatives is interesting too - I've seen it discussed but never
tried
> to get into that.
OK, I'm not sure how readable this will be, but here goes:
<RegexConcat [13]=0x1052672b0 0x10525acf8 0x10525aa58 'B7'>
+---<RegexConcat [11]=0x10525acf8 0x105267198 0x10525ab00 'B7'>
| +---<RegexConcat [10]=0x105267198 0x10525a978 0x10525ad68 'B7'>
| | +---<RegexConcat [2]=0x10525a978 0x10525aa58 0x10525ab70 'B7'>
| | | +---<RegexSym [0]=0x10525aa58 '/'>
| | | `---<RegexSym [1]=0x10525ab70 '*'>
| | `---<RegexStar [9]=0x10525ad68 0x10525af98 '*'>
| | `---<RegexOr [8]=0x10525af98 0x10525add8 0x10525ae80 '|'>
| | +---<RegexSym [4]=0x10525add8 '^*'>
| | `---<RegexConcat [7]=0x10525ae80 0x10525ab00 0x105267048
'B7'>
| | +---<RegexPlus [5]=0x10525ab00 0x10525ab70 '+'>
| | | `---<RegexSym [1]=0x10525ab70 '*'>
| | `---<RegexSym [6]=0x105267048 '^/'>
| `---<RegexPlus [5]=0x10525ab00 0x10525ab70 '+'>
| `---<RegexSym [1]=0x10525ab70 '*'>
`---<RegexSym [0]=0x10525aa58 '/'>
q0: /B7[*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
[/] q2
['\x00'-.0-C?] q1
q1: b
['\x00'-C?] q1
q2: [*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
[*] q3
['\x00'-)+-C?] q1
q3: ([^*] | [*]+B7[^/])*B7[*]+B7/
[*] q4
['\x00'-)+-C?] q3
q4: ([*]*B7[^/]B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
[*] q6
['\x00'-)+-.0-C?] q3
[/] q5
q5: N5
['\x00'-C?] q1
q6: (([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
[*] q8
['\x00'-)+-.0-C?] q3
[/] q7
q7: ([^*] | [*]+B7[^/])*B7[*]+B7/ | N5
[*] q4
['\x00'-)+-C?] q3
q8: ((([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])* | [*]*B7[^/]B7([^*] |
[*]+B7[^/])*)B7[*]+ | [*]*)B7/
[*] q8
['\x00'-)+-.0-C?] q3
[/] q7
Total DFA states: 9
Total RE instances: 35
I consolidated character classes, but I didn't do any nice reduction such as
pruning transitions to the error state.
Thanks,
-Clint
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-18 19:23 +0000 |
| Message-ID | <18-03-074@comp.compilers> |
| In reply to | #2000 |
On 2018-03-13, Ben Hanson <jamin.hanson@googlemail.com> wrote:
> I'd be interested to see your DFA output once you've cleaned it up a bit.
> Using derivatives is interesting too - I've seen it discussed but never tried
> to get into that.
Let's try this again:
<RegexConcat [13]=0x10b32f2b0 0x10b322cf8 0x10b322a58 'B7'>
+---<RegexConcat [11]=0x10b322cf8 0x10b32f198 0x10b322b00 'B7'>
| +---<RegexConcat [10]=0x10b32f198 0x10b322978 0x10b322d68 'B7'>
| | +---<RegexConcat [2]=0x10b322978 0x10b322a58 0x10b322b70 'B7'>
| | | +---<RegexSym [0]=0x10b322a58 '/'>
| | | `---<RegexSym [1]=0x10b322b70 '*'>
| | `---<RegexStar [9]=0x10b322d68 0x10b322f98 '*'>
| | `---<RegexOr [8]=0x10b322f98 0x10b322dd8 0x10b322e80 '|'>
| | +---<RegexSym [4]=0x10b322dd8 '^*'>
| | `---<RegexConcat [7]=0x10b322e80 0x10b322b00 0x10b32f048 'B7'>
| | +---<RegexPlus [5]=0x10b322b00 0x10b322b70 '+'>
| | | `---<RegexSym [1]=0x10b322b70 '*'>
| | `---<RegexSym [6]=0x10b32f048 '^/'>
| `---<RegexPlus [5]=0x10b322b00 0x10b322b70 '+'>
| `---<RegexSym [1]=0x10b322b70 '*'>
`---<RegexSym [0]=0x10b322a58 '/'>
q0: /B7[*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
[/] q2
['\x00'-.0-C?] q1
q1: b
['\x00'-C?] q1
q2: [*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
[*] q3
['\x00'-)+-C?] q1
q3: ([^*] | [*]+B7[^/])*B7[*]+B7/
[*] q4
['\x00'-)+-C?] q3
q4: ([*]*B7[^/]B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
[*] q6
['\x00'-)+-.0-C?] q3
[/] q5
q5: N5
['\x00'-C?] q1
q6: (([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
[*] q8
['\x00'-)+-.0-C?] q3
[/] q7
q7: ([^*] | [*]+B7[^/])*B7[*]+B7/ | N5
[*] q4
['\x00'-)+-C?] q3
q8: ((([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])* | [*]*B7[^/]B7([^*] | [*]+B7[^/])*)B7[*]+ | [*]*)B7/
[*] q8
['\x00'-)+-.0-C?] q3
[/] q7
Total DFA states: 9
Total RE instances: 35
So, q5, q7 are accepting states because they contain epsilon (empty string).
Thanks,
-Clint
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-20 17:23 +0000 |
| Message-ID | <18-03-087@comp.compilers> |
| In reply to | #2000 |
[ reposted to try to make the special characters look right ]
/·[*]·([^*] | [*]+·[^/])*·[*]+·/
<RegexConcat [13]=0x10ccc6f60 0x10ccc6f98 0x10ccc67f0 '·'>
+---<RegexConcat [11]=0x10ccc6f98 0x10ccc6ef0 0x10ccc6898 '·'>
| +---<RegexConcat [10]=0x10ccc6ef0 0x10ccc6710 0x10ccc6b00 '·'>
| | +---<RegexConcat [2]=0x10ccc6710 0x10ccc67f0 0x10ccc6908 '·'>
| | | +---<RegexSym [0]=0x10ccc67f0 '/'>
| | | `---<RegexSym [1]=0x10ccc6908 '*'>
| | `---<RegexStar [9]=0x10ccc6b00 0x10ccc6c88 '*'>
| | `---<RegexOr [8]=0x10ccc6c88 0x10ccc6b70 0x10ccc6c18 '|'>
| | +---<RegexSym [4]=0x10ccc6b70 '^*'>
| | `---<RegexConcat [7]=0x10ccc6c18 0x10ccc6898 0x10ccc6da0 '·'>
| | +---<RegexPlus [5]=0x10ccc6898 0x10ccc6908 '+'>
| | | `---<RegexSym [1]=0x10ccc6908 '*'>
| | `---<RegexSym [6]=0x10ccc6da0 '^/'>
| `---<RegexPlus [5]=0x10ccc6898 0x10ccc6908 '+'>
| `---<RegexSym [1]=0x10ccc6908 '*'>
`---<RegexSym [0]=0x10ccc67f0 '/'>
q0: /·[*]·([^*] | [*]+·[^/])*·[*]+·/
[/] q2
['\x00'-.0-ÿ] q1
q1: ∅
['\x00'-ÿ] q1
q2: [*]·([^*] | [*]+·[^/])*·[*]+·/
[*] q3
['\x00'-)+-ÿ] q1
q3: ([^*] | [*]+·[^/])*·[*]+·/
[*] q4
['\x00'-)+-ÿ] q3
q4: ([*]*·[^/]·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
[*] q6
['\x00'-)+-.0-ÿ] q3
[/] q5
q5: ε
['\x00'-ÿ] q1
q6: (([*]*·[^/] | ε)·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
[*] q8
['\x00'-)+-.0-ÿ] q3
[/] q7
q7: ([^*] | [*]+·[^/])*·[*]+·/ | ε
[*] q4
['\x00'-)+-ÿ] q3
q8: ((([*]*·[^/] | ε)·([^*] | [*]+·[^/])* | [*]*·[^/]·([^*] | [*]+·[^/])*)·[*]+ | [*]*)·/
[*] q8
['\x00'-)+-.0-ÿ] q3
[/] q7
Total DFA states: 9
Total RE instances: 35
[toc] | [prev] | [next] | [standalone]
| From | Clint O <clint.olsen@gmail.com> |
|---|---|
| Date | 2018-03-22 17:46 +0000 |
| Message-ID | <18-03-089@comp.compilers> |
| In reply to | #2019 |
On 2018-03-20, Clint O <clint.olsen@gmail.com> wrote: > [ reposted to try to make the special characters look right ] > > q0: /·[*]·([^*] | [*]+·[^/])*·[*]+·/ > [/] q2 > ['\x00'-.0-ÿ] q1 > q1: ∅ > ['\x00'-ÿ] q1 > q2: [*]·([^*] | [*]+·[^/])*·[*]+·/ > [*] q3 > ['\x00'-)+-ÿ] q1 > q3: ([^*] | [*]+·[^/])*·[*]+·/ > [*] q4 > ['\x00'-)+-ÿ] q3 > q4: ([*]*·[^/]·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/ > [*] q6 > ['\x00'-)+-.0-ÿ] q3 > [/] q5 > q5: ε > ['\x00'-ÿ] q1 > q6: (([*]*·[^/] | ε)·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/ > [*] q8 > ['\x00'-)+-.0-ÿ] q3 > [/] q7 > q7: ([^*] | [*]+·[^/])*·[*]+·/ | ε > [*] q4 > ['\x00'-)+-ÿ] q3 > q8: ((([*]*·[^/] | ε)·([^*] | [*]+·[^/])* | [*]*·[^/]·([^*] | [*]+·[^/])*)·[*]+ | [*]*)·/ > [*] q8 > ['\x00'-)+-.0-ÿ] q3 > [/] q7 Thanks John for reposting this. It looks much better now. In summary: q5,q7 are accepting states since they contain epsilon. q2 represents the error state. The key to success with this algorithm is recognizing previously calculated derivatives/expressions. When you no longer calculate unique derivatives, the DFA construction terminates. As you can see the expressions can get hairy pretty quickly. I don't know if you can glean much from the successive expressions generated. It's akin to the method of walking the parse tree but it bypasses the NFA construction entirely. Thanks, -Clint
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-12 15:46 -0700 |
| Message-ID | <18-03-046@comp.compilers> |
| In reply to | #1990 |
> [Doesn't that depend on whether you interpret the END STATE in state 6 to stop even > if there's more input? -John] That is why we have the "leftmost longest" rule. Try it out with any regex engine (including flex). Regards, Ben
[toc] | [prev] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@netscape.net> |
|---|---|
| Date | 2018-03-13 10:53 +0100 |
| Message-ID | <18-03-050@comp.compilers> |
| In reply to | #1990 |
Am 11.03.2018 um 21:52 schrieb Ben Hanson: >> /This/ actually worked for me (one character change): >> >> [/][*]([^*]|[*]+[^/])*[*]+[/] > > Your modified regex produces the following state machine: > > State: 0 > [/] -> 1 > > State: 1 > [*] -> 2 > > State: 2 > [^*] -> 2 > [*] -> 3 I'm just curious about the notation. What happens if neither pattern matches? What's the purpose of [^*] in state 2, as opposed to state 1? DoDi
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-13 14:23 -0700 |
| Message-ID | <18-03-059@comp.compilers> |
| In reply to | #1997 |
On Tuesday, 13 March 2018 19:33:53 UTC, Hans-Peter Diettrich wrote: > > State: 0 > > [/] -> 1 > > > > State: 1 > > [*] -> 2 > > > > State: 2 > > [^*] -> 2 > > [*] -> 3 > > I'm just curious about the notation. > What happens if neither pattern matches? > What's the purpose of [^*] in state 2, as opposed to state 1? In state 2 you are either looping on [^*] (anything other than '*') or changing state to state 3 on '*'. I'm not sure I understand your question. Regards, Ben
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2018-03-10 03:17 -0800 |
| Message-ID | <18-03-036@comp.compilers> |
| In reply to | #1984 |
Here's the dump of the state machine for that regex using lexertl: State: 0 [/] -> 1 State: 1 [*] -> 2 State: 2 [^*] -> 2 [*] -> 3 State: 3 [^*/] -> 2 [*] -> 3 [/] -> 4 State: 4 END STATE Regards, Ben
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web