Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: Interesting paper on regex NFA matching Date: Thu, 01 Feb 2024 04:09:48 +0200 Organization: Compilers Central Sender: johnl%iecc.com Approved: comp.compilers@iecc.com Message-ID: <24-01-010@comp.compilers> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="25978"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 31 Jan 2024 22:48:43 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3552 Kaz Kylheku wrote: > But some backtracking is required for recognizing the longest matching > prefixes of the input, even if we use nothing but basic regex features! The key word in your sentence is "prefixes". The recognition problem is not about prefixes, but about complete strings. This is why my answer mentioned that the distinction between lexing (and parsing) and recognition is not covered in any automata books I have read. The definition of regular expressions is that when the finite automata halts (at the end of the string or when there are no transitions out of the state that it is in) it is either in an accepting state or not. That's the language recognition problem. There is no mention of "backing up to the last state that accepted." And for language recognition that is important, otherwise, any random trailing garbage after a legal string would still be considered a match because a prefix matched. Thus, given your regular expression (a*b)+., b, ab, aab, aaab, ... , bb, abb, aabb, ..., bab, baab, ...., aaaabaaab are all in the language, but aaabaaabaaac does not match that regular expression. A prefix of it does, but not the entire string. And, thus the string aaabaaabaaac does not satisfy the language recognition problem. Now, when you ask does a string which is a prefix of aaabaaabaaac match the regular expression, you get two strings which qualify aaab and aaabaab.But that's not the *language recognition* problem. That's a different problem, one relevant to lexing. However, either string satisfies the question of is this string in the language. Thus, both strings aaab and aaabaaab satisfy the language recognition problem. Only one satisfies your definition of the "correct" lexing problem because you are fixated on a specific extension of how regular expressions need to work. It is not a bad extension, but it is an extension of the meaning of regular expressions in the lexing context which is not the context of language recognition which is where regular expressions were defined. Notably, if one focuses on the language recognition problem, the one where prefixes don't count, only whole strings do, then whether the | operator is commutative or not, does not matter. It only matters when one can terminate (and accept) without considering the entire string. Still, there are cases where one might prefer to get the shortest answer, rather than the longest answer. Or cases, where one might prefer a shorter answer when two possible answers match. Those situations require a more complex language. A language which regular expressions are only a subset of. This class of languages is what the PCRE (and PEG) extensions allow one to access. It is possible to express every regular expression in those languages, trivially so with PCRE and with only a small amount of care with PEGs (whose / operator is non-commutative and chosen that way not to be confused with the | operator). And, moreover, you can get the same results for prefix matching if that is what you desire. However, you can also get results where other matches are the ones that are "preferred" and returned. The user simply has more choices. Now, if one hasn't chosen wisely and one has carelessly written an "extended" regular expression which returns a different preferred choice than the one which one expected, that is the fault of the writer not the formalism. People are just as capable of writing buggy regular expressions as they are of writing buggy programs, and that's true whether using the extensions or not. Maybe the user really wanted "(a+b)+" or the more complex "(a*b)+a*" or some other variant. And while there are buggy implementations of PCRE regular expressions, that doesn't mean the formalism itself is bad. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------