Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.programming > #899
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Newsgroups | comp.programming |
| Subject | Re: Deterministic Finite State Machine for regexs. |
| References | <9I4hq.4$yO1.1@newsfe21.iad> <ytWdnZP-mL_e1BXTnZ2dnUVZ7vSdnZ2d@bt.com> |
| Message-ID | <fe6iq.938$6S1.306@newsfe16.iad> (permalink) |
| Date | 2011-10-02 16:13 -0700 |
On 10/2/11 4:13 AM, Chris Uppal wrote:
> Daniel Pitts wrote:
>
> [on the subject of irregular extensions to regular language parsing]
>> I was wondering if anyone has any thoughts on how to implement that kind
>> of thing? I'd really like to the internal representation of a word
>> boundary to be equivalent to (last != \w&& current == \w || last == \w
>> && current != \w)
>>
>> On an related note, I'd like to know if anyone has thoughts about
>> supporting back-references and/or group-capturing with such a system.
>> Its not as important for my current use-cases, but if I can build such a
>> generic tool, it will almost certainly be useful in the future.
>
> If you haven't already found Ross Cox's pages on regular expressions, then you
> may find it helpful. A fair bit of it is going over the NFA/DFA stuff which
> you already know, but there are historical notes, and thoughts on
> implementations of non-regular features, plus descriptions of how his own
> library (RE2) implements such things. Start with:
>
> http://swtch.com/~rsc/regexp/regexp3.html
>
> BTW, if have my own little library too, and have faced (without actually
> bothering to solve -- yet) similar problems. The approach I want to take, but
> haven't yet planned in enough detail to see if it'll even work let alone be
> acceptable fast, is to allow nodes in the NFA to be annotated with more-or-less
> arbitrary code (to be executed by the automaton) and which would (in some
> magical way) be preserved during translation to the DFA form (presumably it
> would normally result in larger DFA, since reachable state-sets couldn't be
> joined unless the code segments were equivalent along all paths leading to
> those states).
Thanks for the link, it was interesting to read.
In particular, the Caveats which point out the same limitations my
implementation has:
http://swtch.com/~rsc/regexp/regexp3.html
> Caveats
>
> RE2 explicitly does not attempt to handle every extension that Perl has introduced. The Perl extensions it supports are: non-greedy repetition; character classes like \d; and empty assertions like \A, \b, \B, and \z. RE2 does not support arbitrary lookahead or lookbehind assertions, nor does it support backreferences. It supports counted repetition, but it is implemented by actual repetition (\d{3} becomes \d\d\d), so large repetition counts are unwise. RE2 supports Python-style named captures (?P<name>expr), but not the alternate syntaxes (?<name>expr) and (?'name'expr) used by .NET and Perl.
It is interesting to RE2 has so many "special cases" for different kinds
of searching/matching, but for a general purpose library that makes sense.
In my case, the real use-case is very specific (finding the set of a
bunch of special keywords in a text, to replace with specific
substrings). Where the keywords to search change infrequently but the
text changes very frequently.
I've got enough implemented that I probably don't need to bother
creating a full-fledged RE library such as RE2, but I'm also attracted
to academic subjects and have been interested in this problem for a while.
Interestingly enough, my solution for "\b" (word boundary) looks like it
would be feasible to implement in RE2, with some pre-processing of the
regex. \b* or \b+ for instance would break my code, but doesn't make
much sense from a user POV anyway, and could be replaced with
non-repeating forms easily without changing the semantics.
Back to comp.programming | Previous | Next — Previous in thread | Next in thread | Find similar
Deterministic Finite State Machine for regexs. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-09-29 13:39 -0700
Re: Deterministic Finite State Machine for regexs. "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2011-10-02 12:13 +0100
Re: Deterministic Finite State Machine for regexs. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-10-02 16:13 -0700
Re: Deterministic Finite State Machine for regexs. Rui Maciel <rui.maciel@gmail.com> - 2011-10-03 11:55 +0100
csiph-web