Groups > comp.compilers > #1973 > unrolled thread

Regular expression string searching & matching

Started by	Clint O <clint.olsen@gmail.com>
First post	2018-03-04 01:37 -0800
Last post	2018-03-10 03:17 -0800
Articles	16 — 3 participants

Back to article view | Back to comp.compilers

  Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-04 01:37 -0800
    Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 11:53 -0800
      Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-07 12:18 -0800
      Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-08 22:53 -0800
        Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-10 00:57 -0800
          Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-11 13:52 -0700
            Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-12 14:00 -0700
              Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 11:30 -0700
                Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-17 16:52 -0700
                Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-18 19:23 +0000
                Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-20 17:23 +0000
                  Re: Regular expression string searching & matching Clint O <clint.olsen@gmail.com> - 2018-03-22 17:46 +0000
            Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-12 15:46 -0700
            Re: Regular expression string searching & matching Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2018-03-13 10:53 +0100
              Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-13 14:23 -0700
        Re: Regular expression string searching & matching Ben Hanson <jamin.hanson@googlemail.com> - 2018-03-10 03:17 -0800

#1973 — Regular expression string searching & matching

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-04 01:37 -0800
Subject	Regular expression string searching & matching
Message-ID	<18-03-016@comp.compilers>

Hi:

I'm working on an implementation of Brzozowski derivatives[1] to translate
regular expressions into a DFAs for potential use as a RE debugger and
programming tool. I've been using the paper "Regular-expression derivatives
reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.

One of the interesting things about using derivatives is that it allows for
some elegant RE set notation. So, rather than just the union -

r: a | b  # a OR b

we also have set intersection and complement as well.

I've got things working pretty well, and I'm trying to test things out.
Unfortunately, I haven't seen much discussion on how string search works,
except for what little I know about the behavior of flex and grep.

Match is pretty straightforward: You begin in the starting state q0 and read
the input and follow the state transitions. If you get to the end of the
string without hitting the error state and you are in an accepting state
(Brzozowski refers to states as "nullable"), then this string is accepted by
this RE.

Search is a different beast: Here we have a string of input and we are
interested in knowing if any substring in the input matches our RE (like
grep).

The naive approach I have taken has been to start the DFA at every possible
position in the string, and start simulating the DFA. If I hit any accepting
states, I record the start, stop offsets. If I hit the error state or end of
string, I move to the next starting position. I then merge all of the
intervals at the end to cover the cases where I hit an accepting state
multiple times for a given start point. Note: I assume this is the faithful
way to do it because it's my understanding that we should aim for the longest
possible match. If I stop at the first accepting state, I will miss
potentially longer matches.

I hit an interesting case when trying out finding C-comments. Owens mentions
that with complement you can find non-nested C-comments with:

/\* ~(.* \*/ .*) \*/

and this works for me. However, I wasn't able to find an RE _without_ using
complement that exhibits identical behavior because of the inherent greedy
nature of REs that make it scan through successive non-overlapping comments in
the text until the final closing '*/'.

There was an interesting stackoverflow discussion[3] on this with links to
explanatory regex interpreters, and when I tried:

/\*[^*]*\*+(?:[^/*][^*]*\*+)*/

for me it continues through the input until it finds the _last_ '*/'.

I also tried constructing my own versions and came up with similar results.

Questions:

1) Am I applying the RE/DFA search algorithm correctly?

2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
an RE->DFA implementation? A quick search made it sound like only backtracking
implementations can implement this behavior?

Thanks,

-Clint


[1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
of the ACM, 11(4), 481-494.

[2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
re-examined. Journal of Functional Programming 19 (2), 173-190

[3]
https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment

[toc] | [next] | [standalone]

#1982

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-07 11:53 -0800
Message-ID	<18-03-032@comp.compilers>
In reply to	#1973

[/][*]([^*]|[*]+[^*/])*[*]+[/]

is what you are looking for. I ran into this when developing my lexer
generator library lexertl in C++. Having a debug::dump() function
really helped me grok what was going on.

The trick of course is realising that you have to exclude the
characters that follow (i.e. the [^*/] part). That is the bit that
clobbers the greedy behaviour. I've had to remind myself of that on
more than one occasion recently!

Hope that helps.

Regards,

Ben

On Sunday, 4 March 2018 14:41:17 UTC, Clint O  wrote:
> Hi:
>
> I'm working on an implementation of Brzozowski derivatives[1] to translate
> regular expressions into a DFAs for potential use as a RE debugger and
> programming tool. I've been using the paper "Regular-expression derivatives
> reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.
>
> One of the interesting things about using derivatives is that it allows for
> some elegant RE set notation. So, rather than just the union -
>
> r: a | b  # a OR b
>
> we also have set intersection and complement as well.
>
> I've got things working pretty well, and I'm trying to test things out.
> Unfortunately, I haven't seen much discussion on how string search works,
> except for what little I know about the behavior of flex and grep.
>
> Match is pretty straightforward: You begin in the starting state q0 and read
> the input and follow the state transitions. If you get to the end of the
> string without hitting the error state and you are in an accepting state
> (Brzozowski refers to states as "nullable"), then this string is accepted by
> this RE.
>
> Search is a different beast: Here we have a string of input and we are
> interested in knowing if any substring in the input matches our RE (like
> grep).
>
> The naive approach I have taken has been to start the DFA at every possible
> position in the string, and start simulating the DFA. If I hit any accepting
> states, I record the start, stop offsets. If I hit the error state or end of
> string, I move to the next starting position. I then merge all of the
> intervals at the end to cover the cases where I hit an accepting state
> multiple times for a given start point. Note: I assume this is the faithful
> way to do it because it's my understanding that we should aim for the longest
> possible match. If I stop at the first accepting state, I will miss
> potentially longer matches.
>
> I hit an interesting case when trying out finding C-comments. Owens mentions
> that with complement you can find non-nested C-comments with:
>
> /\* ~(.* \*/ .*) \*/
>
> and this works for me. However, I wasn't able to find an RE _without_ using
> complement that exhibits identical behavior because of the inherent greedy
> nature of REs that make it scan through successive non-overlapping comments in
> the text until the final closing '*/'.
>
> There was an interesting stackoverflow discussion[3] on this with links to
> explanatory regex interpreters, and when I tried:
>
> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
>
> for me it continues through the input until it finds the _last_ '*/'.
>
> I also tried constructing my own versions and came up with similar results.
>
> Questions:
>
> 1) Am I applying the RE/DFA search algorithm correctly?
>
> 2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
> an RE->DFA implementation? A quick search made it sound like only backtracking
> implementations can implement this behavior?
>
> Thanks,
>
> -Clint
>
>
> [1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
> of the ACM, 11(4), 481-494.
>
> [2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
> re-examined. Journal of Functional Programming 19 (2), 173-190
>
> [3]
> https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment

[toc] | [prev] | [next] | [standalone]

#1983

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-07 12:18 -0800
Message-ID	<18-03-033@comp.compilers>
In reply to	#1982

I missed your question about non-greedy repeats.

Yes, it is possible. See build_dfa() in generator.hpp from lexertl.

Basically non-greedy transitions are snipped when building the dfa. I build a
regex syntax tree as suggested in the Dragon Book and I keep track of greedy
flags in the tree and that is passed down to partition/equivset.hpp and from
there to the generator. The thing you have to careful about is respecting that
the left side takes priority (i.e. the regex or sub-regex that came first).

Regards,

Ben

[toc] | [prev] | [next] | [standalone]

#1984

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-08 22:53 -0800
Message-ID	<18-03-034@comp.compilers>
In reply to	#1982

Hi Ben:

Thanks for your post. I did try your regular expression (and a few small
variations on it), but it exhibits the same behavior as the others I have
tried.

The difference with the complement version is that the accepting state I end
up with has all transitions to the error state (which guarantees termination
after match) where as these seem to still accept characters even after
matching the closing '*/'. It's possible I have a bug in my implementation, so
I'm still looking at it.

Thanks,

-Clint

On Wednesday, March 7, 2018 at 11:59:10 AM UTC-8, Ben Hanson wrote:
> [/][*]([^*]|[*]+[^*/])*[*]+[/]
>
> is what you are looking for. I ran into this when developing my lexer
> generator library lexertl in C++. Having a debug::dump() function
> really helped me grok what was going on.
>
> The trick of course is realising that you have to exclude the
> characters that follow (i.e. the [^*/] part). That is the bit that
> clobbers the greedy behaviour. I've had to remind myself of that on
> more than one occasion recently!

[This should work, it's a standard example in compiler texts. -John]

[toc] | [prev] | [next] | [standalone]

#1985

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-10 00:57 -0800
Message-ID	<18-03-035@comp.compilers>
In reply to	#1984

On Friday, March 9, 2018 at 6:47:09 AM UTC-8, Clint O wrote:
> On Wednesday, March 7, 2018 at 11:59:10 AM UTC-8, Ben Hanson wrote:
> > [/][*]([^*]|[*]+[^*/])*[*]+[/]
>
> [This should work, it's a standard example in compiler texts. -John]

/This/ actually worked for me (one character change):

[/][*]([^*]|[*]+[^/])*[*]+[/]
                          ^

Thanks for looking at this!

-Clint

[toc] | [prev] | [next] | [standalone]

#1990

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-11 13:52 -0700
Message-ID	<18-03-041@comp.compilers>
In reply to	#1985

> /This/ actually worked for me (one character change):
>
> [/][*]([^*]|[*]+[^/])*[*]+[/]

Your modified regex produces the following state machine:

State: 0
  [/] -> 1

State: 1
  [*] -> 2

State: 2
  [^*] -> 2
  [*] -> 3

State: 3
  [^*/] -> 2
  [*] -> 4
  [/] -> 5

State: 4
  [^*/] -> 2
  [*] -> 4
  [/] -> 6

State: 5
  END STATE

State: 6
  END STATE
  [^*] -> 2
  [*] -> 3

Which will match

/***/a*/

in its entirety, when if should only match

/***/

Regards,

Ben
[Doesn't that depend on whether you interpret the END STATE in state 6 to stop even
if there's more input? -John]

[toc] | [prev] | [next] | [standalone]

#1993

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-12 14:00 -0700
Message-ID	<18-03-045@comp.compilers>
In reply to	#1990

On Monday, March 12, 2018 at 1:19:29 PM UTC-7, Ben Hanson wrote:
> > /This/ actually worked for me (one character change):
> >
> > [/][*]([^*]|[*]+[^/])*[*]+[/]
>
> Your modified regex produces the following state machine:
>
[snip]
>
> Which will match
>
> /***/a*/
>
> in its entirety, when if should only match
>
> /***/
>
> Regards,
>
> Ben
> [Doesn't that depend on whether you interpret the END STATE in state 6 to
stop even
> if there's more input? -John]

Interesting. I'm not seeing this behavior with the sample input you've
provided. Again, I'm willing to concede that I have a bug :) What I'm doing is
simulating the DFA until I get to an error state or I hit EOF. So, this
guarantees I'll record the longest match I've found.

I could post the states that I come up with, but my state dumper also prints
out the RE it's currently processing (the actual expression). The successive
computation of derivatives can sometimes produce some rather abhorrent output,
and it's not always obvious (to me) what's going on. I'll work on a cleaner
presentation and try to post this.

It also looks like you are running a DFA minimizer (like Hopcroft) on your
result since I am not producing a minimal DFA. That also may help me figure
out if I'm producing the right automaton because they'd match...

Thanks,

-Clint

[toc] | [prev] | [next] | [standalone]

#2000

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-13 11:30 -0700
Message-ID	<18-03-054@comp.compilers>
In reply to	#1993

> It also looks like you are running a DFA minimizer (like Hopcroft) on your
> result since I am not producing a minimal DFA. That also may help me figure
> out if I'm producing the right automaton because they'd match...

Actually, although I have a minimising routine I'm not applying it here.

I intersect every character class in the regex to produce non-overlapping sets
and then these are grouped in equivalence sets for each state in the DFA. This
is the part that I didn't find in the Dragon Book (or any other compiler book
I looked in) and it was a message on this group from Vern Paxon that tipped me
off: https://compilers.iecc.com/comparch/article/94-10-091

(I only paid attention to the equivalence classes bit by the way!)

I intersect the equivalence sets as I build the DFA and I have found with this
technique you kind of have to go out of your way to get a non minimal DFA by
default. I've imagined an even more rigorous approach where character classes
are intersected as they are added to the syntax tree up front, but I've not
got around to trying to implement it.

I'd be interested to see your DFA output once you've cleaned it up a bit.
Using derivatives is interesting too - I've seen it discussed but never tried
to get into that.

Regards,

Ben

[toc] | [prev] | [next] | [standalone]

#2011

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-17 16:52 -0700
Message-ID	<18-03-073@comp.compilers>
In reply to	#2000

On Tuesday, March 13, 2018 at 12:36:32 PM UTC-7, Ben Hanson wrote:
> I'd be interested to see your DFA output once you've cleaned it up a bit.
> Using derivatives is interesting too - I've seen it discussed but never
tried
> to get into that.

OK, I'm not sure how readable this will be, but here goes:

<RegexConcat [13]=0x1052672b0 0x10525acf8 0x10525aa58 'B7'>
+---<RegexConcat [11]=0x10525acf8 0x105267198 0x10525ab00 'B7'>
|   +---<RegexConcat [10]=0x105267198 0x10525a978 0x10525ad68 'B7'>
|   |   +---<RegexConcat [2]=0x10525a978 0x10525aa58 0x10525ab70 'B7'>
|   |   |   +---<RegexSym [0]=0x10525aa58 '/'>
|   |   |   `---<RegexSym [1]=0x10525ab70 '*'>
|   |   `---<RegexStar [9]=0x10525ad68 0x10525af98 '*'>
|   |       `---<RegexOr [8]=0x10525af98 0x10525add8 0x10525ae80 '|'>
|   |           +---<RegexSym [4]=0x10525add8 '^*'>
|   |           `---<RegexConcat [7]=0x10525ae80 0x10525ab00 0x105267048
'B7'>
|   |               +---<RegexPlus [5]=0x10525ab00 0x10525ab70 '+'>
|   |               |   `---<RegexSym [1]=0x10525ab70 '*'>
|   |               `---<RegexSym [6]=0x105267048 '^/'>
|   `---<RegexPlus [5]=0x10525ab00 0x10525ab70 '+'>
|       `---<RegexSym [1]=0x10525ab70 '*'>
`---<RegexSym [0]=0x10525aa58 '/'>

q0: /B7[*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
    [/] q2
    ['\x00'-.0-C?] q1
q1: b
    ['\x00'-C?] q1
q2: [*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
    [*] q3
    ['\x00'-)+-C?] q1
q3: ([^*] | [*]+B7[^/])*B7[*]+B7/
    [*] q4
    ['\x00'-)+-C?] q3
q4: ([*]*B7[^/]B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
    [*] q6
    ['\x00'-)+-.0-C?] q3
    [/] q5
q5: N5
    ['\x00'-C?] q1
q6: (([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
    [*] q8
    ['\x00'-)+-.0-C?] q3
    [/] q7
q7: ([^*] | [*]+B7[^/])*B7[*]+B7/ | N5
    [*] q4
    ['\x00'-)+-C?] q3
q8: ((([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])* | [*]*B7[^/]B7([^*] |
[*]+B7[^/])*)B7[*]+ | [*]*)B7/
    [*] q8
    ['\x00'-)+-.0-C?] q3
    [/] q7
Total DFA states: 9
Total RE instances: 35

I consolidated character classes, but I didn't do any nice reduction such as
pruning transitions to the error state.

Thanks,

-Clint

[toc] | [prev] | [next] | [standalone]

#2012

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-18 19:23 +0000
Message-ID	<18-03-074@comp.compilers>
In reply to	#2000

On 2018-03-13, Ben Hanson <jamin.hanson@googlemail.com> wrote:
> I'd be interested to see your DFA output once you've cleaned it up a bit.
> Using derivatives is interesting too - I've seen it discussed but never tried
> to get into that.

Let's try this again:

<RegexConcat [13]=0x10b32f2b0 0x10b322cf8 0x10b322a58 'B7'>
+---<RegexConcat [11]=0x10b322cf8 0x10b32f198 0x10b322b00 'B7'>
|   +---<RegexConcat [10]=0x10b32f198 0x10b322978 0x10b322d68 'B7'>
|   |   +---<RegexConcat [2]=0x10b322978 0x10b322a58 0x10b322b70 'B7'>
|   |   |   +---<RegexSym [0]=0x10b322a58 '/'>
|   |   |   `---<RegexSym [1]=0x10b322b70 '*'>
|   |   `---<RegexStar [9]=0x10b322d68 0x10b322f98 '*'>
|   |       `---<RegexOr [8]=0x10b322f98 0x10b322dd8 0x10b322e80 '|'>
|   |           +---<RegexSym [4]=0x10b322dd8 '^*'>
|   |           `---<RegexConcat [7]=0x10b322e80 0x10b322b00 0x10b32f048 'B7'>
|   |               +---<RegexPlus [5]=0x10b322b00 0x10b322b70 '+'>
|   |               |   `---<RegexSym [1]=0x10b322b70 '*'>
|   |               `---<RegexSym [6]=0x10b32f048 '^/'>
|   `---<RegexPlus [5]=0x10b322b00 0x10b322b70 '+'>
|       `---<RegexSym [1]=0x10b322b70 '*'>
`---<RegexSym [0]=0x10b322a58 '/'>

q0: /B7[*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
    [/] q2
    ['\x00'-.0-C?] q1
q1: b
    ['\x00'-C?] q1
q2: [*]B7([^*] | [*]+B7[^/])*B7[*]+B7/
    [*] q3
    ['\x00'-)+-C?] q1
q3: ([^*] | [*]+B7[^/])*B7[*]+B7/
    [*] q4
    ['\x00'-)+-C?] q3
q4: ([*]*B7[^/]B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
    [*] q6
    ['\x00'-)+-.0-C?] q3
    [/] q5
q5: N5
    ['\x00'-C?] q1
q6: (([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])*B7[*]+ | [*]*)B7/
    [*] q8
    ['\x00'-)+-.0-C?] q3
    [/] q7
q7: ([^*] | [*]+B7[^/])*B7[*]+B7/ | N5
    [*] q4
    ['\x00'-)+-C?] q3
q8: ((([*]*B7[^/] | N5)B7([^*] | [*]+B7[^/])* | [*]*B7[^/]B7([^*] | [*]+B7[^/])*)B7[*]+ | [*]*)B7/
    [*] q8
    ['\x00'-)+-.0-C?] q3
    [/] q7
Total DFA states: 9
Total RE instances: 35

So, q5, q7 are accepting states because they contain epsilon (empty string).

Thanks,

-Clint

[toc] | [prev] | [next] | [standalone]

#2019

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-20 17:23 +0000
Message-ID	<18-03-087@comp.compilers>
In reply to	#2000

[ reposted to try to make the special characters look right ]


/·[*]·([^*] | [*]+·[^/])*·[*]+·/
<RegexConcat [13]=0x10ccc6f60 0x10ccc6f98 0x10ccc67f0 '·'>
+---<RegexConcat [11]=0x10ccc6f98 0x10ccc6ef0 0x10ccc6898 '·'>
|   +---<RegexConcat [10]=0x10ccc6ef0 0x10ccc6710 0x10ccc6b00 '·'>
|   |   +---<RegexConcat [2]=0x10ccc6710 0x10ccc67f0 0x10ccc6908 '·'>
|   |   |   +---<RegexSym [0]=0x10ccc67f0 '/'>
|   |   |   `---<RegexSym [1]=0x10ccc6908 '*'>
|   |   `---<RegexStar [9]=0x10ccc6b00 0x10ccc6c88 '*'>
|   |       `---<RegexOr [8]=0x10ccc6c88 0x10ccc6b70 0x10ccc6c18 '|'>
|   |           +---<RegexSym [4]=0x10ccc6b70 '^*'>
|   |           `---<RegexConcat [7]=0x10ccc6c18 0x10ccc6898 0x10ccc6da0 '·'>
|   |               +---<RegexPlus [5]=0x10ccc6898 0x10ccc6908 '+'>
|   |               |   `---<RegexSym [1]=0x10ccc6908 '*'>
|   |               `---<RegexSym [6]=0x10ccc6da0 '^/'>
|   `---<RegexPlus [5]=0x10ccc6898 0x10ccc6908 '+'>
|       `---<RegexSym [1]=0x10ccc6908 '*'>
`---<RegexSym [0]=0x10ccc67f0 '/'>

q0: /·[*]·([^*] | [*]+·[^/])*·[*]+·/
    [/] q2
    ['\x00'-.0-ÿ] q1
q1: ∅
    ['\x00'-ÿ] q1
q2: [*]·([^*] | [*]+·[^/])*·[*]+·/
    [*] q3
    ['\x00'-)+-ÿ] q1
q3: ([^*] | [*]+·[^/])*·[*]+·/
    [*] q4
    ['\x00'-)+-ÿ] q3
q4: ([*]*·[^/]·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
    [*] q6
    ['\x00'-)+-.0-ÿ] q3
    [/] q5
q5: ε
    ['\x00'-ÿ] q1
q6: (([*]*·[^/] | ε)·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
    [*] q8
    ['\x00'-)+-.0-ÿ] q3
    [/] q7
q7: ([^*] | [*]+·[^/])*·[*]+·/ | ε
    [*] q4
    ['\x00'-)+-ÿ] q3
q8: ((([*]*·[^/] | ε)·([^*] | [*]+·[^/])* | [*]*·[^/]·([^*] | [*]+·[^/])*)·[*]+ | [*]*)·/
    [*] q8
    ['\x00'-)+-.0-ÿ] q3
    [/] q7
Total DFA states: 9
Total RE instances: 35

[toc] | [prev] | [next] | [standalone]

#2021

From	Clint O <clint.olsen@gmail.com>
Date	2018-03-22 17:46 +0000
Message-ID	<18-03-089@comp.compilers>
In reply to	#2019

On 2018-03-20, Clint O <clint.olsen@gmail.com> wrote:
> [ reposted to try to make the special characters look right ]
>
> q0: /·[*]·([^*] | [*]+·[^/])*·[*]+·/
>     [/] q2
>     ['\x00'-.0-ÿ] q1
> q1: ∅
>     ['\x00'-ÿ] q1
> q2: [*]·([^*] | [*]+·[^/])*·[*]+·/
>     [*] q3
>     ['\x00'-)+-ÿ] q1
> q3: ([^*] | [*]+·[^/])*·[*]+·/
>     [*] q4
>     ['\x00'-)+-ÿ] q3
> q4: ([*]*·[^/]·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
>     [*] q6
>     ['\x00'-)+-.0-ÿ] q3
>     [/] q5
> q5: ε
>     ['\x00'-ÿ] q1
> q6: (([*]*·[^/] | ε)·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
>     [*] q8
>     ['\x00'-)+-.0-ÿ] q3
>     [/] q7
> q7: ([^*] | [*]+·[^/])*·[*]+·/ | ε
>     [*] q4
>     ['\x00'-)+-ÿ] q3
> q8: ((([*]*·[^/] | ε)·([^*] | [*]+·[^/])* | [*]*·[^/]·([^*] | [*]+·[^/])*)·[*]+ | [*]*)·/
>     [*] q8
>     ['\x00'-)+-.0-ÿ] q3
>     [/] q7

Thanks John for reposting this. It looks much better now.

In summary:

q5,q7 are accepting states since they contain epsilon. q2 represents the
error state.

The key to success with this algorithm is recognizing previously calculated
derivatives/expressions. When you no longer calculate unique derivatives,
the DFA construction terminates. As you can see the expressions can get
hairy pretty quickly. I don't know if you can glean much from the
successive expressions generated. It's akin to the method of walking the
parse tree but it bypasses the NFA construction entirely.

Thanks,

-Clint

[toc] | [prev] | [next] | [standalone]

#1994

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-12 15:46 -0700
Message-ID	<18-03-046@comp.compilers>
In reply to	#1990

> [Doesn't that depend on whether you interpret the END STATE in state 6 to stop even
> if there's more input? -John]

That is why we have the "leftmost longest" rule. Try it out with any regex engine (including flex).

Regards,

Ben

[toc] | [prev] | [next] | [standalone]

#1997

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2018-03-13 10:53 +0100
Message-ID	<18-03-050@comp.compilers>
In reply to	#1990

Am 11.03.2018 um 21:52 schrieb Ben Hanson:
>> /This/ actually worked for me (one character change):
>>
>> [/][*]([^*]|[*]+[^/])*[*]+[/]
>
> Your modified regex produces the following state machine:
>
> State: 0
>    [/] -> 1
>
> State: 1
>    [*] -> 2
>
> State: 2
>    [^*] -> 2
>    [*] -> 3

I'm just curious about the notation.
What happens if neither pattern matches?
What's the purpose of [^*] in state 2, as opposed to state 1?

DoDi

[toc] | [prev] | [next] | [standalone]

#2003

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-13 14:23 -0700
Message-ID	<18-03-059@comp.compilers>
In reply to	#1997

On Tuesday, 13 March 2018 19:33:53 UTC, Hans-Peter Diettrich  wrote:
> > State: 0
> >    [/] -> 1
> >
> > State: 1
> >    [*] -> 2
> >
> > State: 2
> >    [^*] -> 2
> >    [*] -> 3
>
> I'm just curious about the notation.
> What happens if neither pattern matches?
> What's the purpose of [^*] in state 2, as opposed to state 1?

In state 2 you are either looping on [^*] (anything other than '*') or changing state to state 3 on '*'.

I'm not sure I understand your question.

Regards,

Ben

[toc] | [prev] | [next] | [standalone]

#1986

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2018-03-10 03:17 -0800
Message-ID	<18-03-036@comp.compilers>
In reply to	#1984

Here's the dump of the state machine for that regex using lexertl:

State: 0
  [/] -> 1

State: 1
  [*] -> 2

State: 2
  [^*] -> 2
  [*] -> 3

State: 3
  [^*/] -> 2
  [*] -> 3
  [/] -> 4

State: 4
  END STATE

Regards,

Ben

[toc] | [prev] | [standalone]

csiph-web

Regular expression string searching & matching

Contents

#1973 — Regular expression string searching & matching

#1982

#1983

#1984

#1985

#1990

#1993

#2000

#2011

#2012

#2019

#2021

#1994

#1997

#2003

#1986