Path: csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.snarked.org!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Hans-Peter Diettrich Newsgroups: comp.compilers Subject: Re: How make multifinished DFA for merged regexps? Date: Tue, 24 Dec 2019 02:15:40 +0100 Organization: Compilers Central Lines: 23 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <19-12-026@comp.compilers> References: <19-12-005@comp.compilers> <19-12-010@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="60281"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 23 Dec 2019 21:55:05 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2407 Am 21.12.2019 um 01:29 schrieb Andy: > Greedy algorithms match longest regexp. For example operators "+" and "++", > int numbers "123" and float numbers "123.456e3". > On '.' will finish state of number, but we will inside automata for float > number. But can be errors: after '.' will 'a'. We must backtrack to last > finished state? Why should "123." not form a valid float number? In fact it's the C way to force a possibly int number into a float. If your lexer requires backtracking, because it e.g. is LR(n), then this is the only solution. Unlike parsers, which may work based on shift/reduce actions, a scanner should be made simpler. > I want avoid backtracking. Maybe after backtracking we must > read chars from auxiliary token buffer instead of stream up to previous > position? But this complicated parsing. Parsers require a lookahead of at least one token. So scanners should implement at least a lookahead of one character, depending on the complexity or weirdness of a language definition. DoDi