Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <433-929-6894@kylheku.com> Newsgroups: comp.compilers Subject: Re: Interesting paper on regex NFA matching Date: Sat, 27 Jan 2024 20:20:29 -0000 Organization: Compilers Central Sender: johnl%iecc.com Approved: comp.compilers@iecc.com Message-ID: <24-01-007@comp.compilers> References: <24-01-006@comp.compilers> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="81627"; mail-complaints-to="abuse@iecc.com" Keywords: lex, comment Posted-Date: 27 Jan 2024 22:19:04 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3549 On 2024-01-27, Christopher F Clark wrote: > Kaz Kylheku <433-929-6894@kylheku.com> wrote: > >> (I personaly am not so interested in backtracking regexes; and they are >> not directly applicable to compilers. You'd never want to design a >> lexical analyzer that requires "lookaround" or whatnot in order to >> recognize tokens. It's nice they are generating Arxiv papers for >> someone; no genuine intellectual inquiry is invalid; good luck! It's >> not applicable to the "Programming Languages" category it has been filed >> under. I will skim through it, though.) > > You are mistaken that they are not applicable to programming languages. > > Even Pascal [originally] had cases where lookahead and backtracking > are required to properly tokenize the input. > > The well-known case is "1." if followed by another "." it is two tokens, > and if followed by a digit it is a floating point number. > I don't recall what it should be if followed by any other character. I think that case can be handled by technology like Lex trailing contexts, which doesn't require backtracking. {digits}/. { // ... set up semantic value .. return INTEGER; } {digits}.{digits} { // ... set up semantic value .. return FLOAT; } Another possibility is to recongize {digits}. as one lexeme and call unput('.'). That's a minor form of backtracking, that doesn't require us to do anything funny to the regex implementation. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca [Patterns with alternatives and trailing contexts do have to back up and the flex manual warns it's quite expensive. See the PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]