How make multifinished DFA for merged regexps?

Path	csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	How make multifinished DFA for merged regexps?
Date	Mon, 23 Dec 2019 05:40:43 -0500
Organization	Compilers Central
Lines	59
Sender	news@iecc.com
Approved	comp.compilers@iecc.com
Message-ID	<19-12-021@comp.compilers> (permalink)
Mime-Version	1.0
Content-Type	text/plain; charset="UTF-8"
Content-Transfer-Encoding	8bit
Injection-Info	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="26633"; mail-complaints-to="abuse@iecc.com"
Keywords	DFA, parse
Posted-Date	23 Dec 2019 18:29:13 EST
X-submission-address	compilers@iecc.com
X-moderator-address	compilers-request@iecc.com
X-FAQ-and-archives	http://compilers.iecc.com
Xref	csiph.com comp.compilers:2402

Show key headers only | View raw

Andy borucki.andrzej@gmail.com wrote:

> Greedy algorithms match longest regexp. For example operators "+" and "++",
> int numbers "123" and float numbers "123.456e3".
> On '.' will finish state of number, but we will inside automata for float
> number. But can be errors: after '.' will 'a'. We must backtrack to last
> finished state? I want avoid backtracking. Maybe after backtracking we must
> read chars from auxiliary token buffer instead of stream up to previous
> position? But this complicated parsing.

1) Backtracking is one solution.

Using a character buffer (what you called a token buffer) rather than
reading the raw input can be used with or without backtracking.  Using
a character buffer also tends to be efficient as you only call the I/O
routines infrequently in that case.  Most lexers (such as flex) have
used character buffers to speed up their lexing significantly.  The
only problem is if you mix lexing with reading from the input stream
directly, but that is a bad idea in itself.

2) Yet, another solution is to find the cases where errors are caused
and use them to make rules that return multiple tokens.   E.g. for the
case “123.a”  make a rule for “123.” That returns two tokens “123” as
an int and “.” as a dot (presuming that is a token in its own right).
This works well as long as there aren’t infinite strings of characters
that toggle back and forth as to what is legal and you can make a
finite set of rules that disambiguate what you are lexing.

This is not an easy problem to solve. A fictitious language with
tokens “a”, “ab”, “b”, and “ba” with longest match rule would find the
string “aba” ambiguous as to whether you meant “ab” “a” or “a” “ba” as
your two tokens and it is possible only the parser will know which you
wanted, or even worse it can be ambiguous even at the parsing level.
Generally, people give up on solving the problem.

Moreover, in practice, because such language fragments are
problematic.  They apply the left-most longest rule and declare things
like “123.a” to be an error.  It is actually a pretty rare problem,
because usually the token sequence for “123” “.” and “a” as 3 tokens
isn’t legal to the parser.

3) Finally, you can define your tokens to be only unique substrings
and put them together in the parser.  E.g. “123” is always an “int”
token and “.” Is always a “dot” token and there is no “float” token,
just a float non-terminal that is “int dot int etc.”.  Now, that is
also not a panacea.  Especially if you have your lexer doing things
like throwing away whitespace and comments.  On the other hand, I have
seen languages that allowed whitespace (and even occasionally
comments) inside of tokens.  That has more to do with your language
specification/design.

--
*****************************************************************************
*
Chris Clark                  email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris

Back to comp.compilers | Previous | Next | Find similar

Thread

How make multifinished DFA for merged regexps? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2019-12-23 05:40 -0500

csiph-web