Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.programming > #899

Re: Deterministic Finite State Machine for regexs.

From Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Newsgroups comp.programming
Subject Re: Deterministic Finite State Machine for regexs.
References <9I4hq.4$yO1.1@newsfe21.iad> <ytWdnZP-mL_e1BXTnZ2dnUVZ7vSdnZ2d@bt.com>
Message-ID <fe6iq.938$6S1.306@newsfe16.iad> (permalink)
Date 2011-10-02 16:13 -0700

Show all headers | View raw


On 10/2/11 4:13 AM, Chris Uppal wrote:
> Daniel Pitts wrote:
>
> [on the subject of irregular extensions to regular language parsing]
>> I was wondering if anyone has any thoughts on how to implement that kind
>> of thing? I'd really like to the internal representation of a word
>> boundary to be equivalent to (last != \w&&  current == \w || last == \w
>> &&  current != \w)
>>
>> On an related note, I'd like to know if anyone has thoughts about
>> supporting back-references and/or group-capturing with such a system.
>> Its not as important for my current use-cases, but if I can build such a
>> generic tool, it will almost certainly be useful in the future.
>
> If you haven't already found Ross Cox's pages on regular expressions, then you
> may find it helpful.  A fair bit of it is going over the NFA/DFA stuff which
> you already know, but there are historical notes, and thoughts on
> implementations of non-regular features, plus descriptions of how his own
> library (RE2) implements such things.  Start with:
>
>      http://swtch.com/~rsc/regexp/regexp3.html
>
> BTW, if have my own little library too, and have faced (without actually
> bothering to solve -- yet) similar problems.  The approach I want to take, but
> haven't yet planned in enough detail to see if it'll even work let alone be
> acceptable fast, is to allow nodes in the NFA to be annotated with more-or-less
> arbitrary code (to be executed by the automaton) and which would (in some
> magical way) be preserved during translation to the DFA form (presumably it
> would normally result in larger DFA, since reachable state-sets couldn't be
> joined unless the code segments were equivalent along all paths leading to
> those states).

Thanks for the link, it was interesting to read.
In particular, the Caveats which point out the same limitations my 
implementation has:

http://swtch.com/~rsc/regexp/regexp3.html
> Caveats
>
> RE2 explicitly does not attempt to handle every extension that Perl has introduced. The Perl extensions it supports are: non-greedy repetition; character classes like \d; and empty assertions like \A, \b, \B, and \z. RE2 does not support arbitrary lookahead or lookbehind assertions, nor does it support backreferences. It supports counted repetition, but it is implemented by actual repetition (\d{3} becomes \d\d\d), so large repetition counts are unwise. RE2 supports Python-style named captures (?P<name>expr), but not the alternate syntaxes (?<name>expr) and (?'name'expr) used by .NET and Perl.

It is interesting to RE2 has so many "special cases" for different kinds 
of searching/matching, but for a general purpose library that makes sense.

In my case, the real use-case is very specific (finding the set of a 
bunch of special keywords in a text, to replace with specific 
substrings).  Where the keywords to search change infrequently but the 
text changes very frequently.

I've got enough implemented that I probably don't need to bother 
creating a full-fledged RE library such as RE2, but I'm also attracted 
to academic subjects and have been interested in this problem for a while.

Interestingly enough, my solution for "\b" (word boundary) looks like it 
would be feasible to implement in RE2, with some pre-processing of the 
regex.  \b* or \b+ for instance would break my code, but doesn't make 
much sense from a user POV anyway, and could be replaced with 
non-repeating forms easily without changing the semantics.

Back to comp.programming | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Deterministic Finite State Machine for regexs. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-09-29 13:39 -0700
  Re: Deterministic Finite State Machine for regexs. "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2011-10-02 12:13 +0100
    Re: Deterministic Finite State Machine for regexs. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-10-02 16:13 -0700
  Re: Deterministic Finite State Machine for regexs. Rui Maciel <rui.maciel@gmail.com> - 2011-10-03 11:55 +0100

csiph-web