Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: RE: An ingenious scanning problem!
Date: Wed, 1 Jun 2022 13:34:22 +0300
Organization: Compilers Central
Lines: 91
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-001@comp.compilers>
References: <22-05-053@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="17443"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 01 Jun 2022 13:15:31 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3039

I will skip most of Roger's posting prior to the question:

> Question: After the action for TEST1\n is executed, which newline rule is
> executed?
>
> The answer depends on the meaning of yyless(). Here are two possible meanings
> for yyless():
>
> (a) The scanner is backing up in the input stream.
> (b) The scanner is pushing new characters onto the input stream.

I find the Flex choice of "b" to be semantically incorrect and a bug.
because you cannot get functionality "a" if you do that.  However, I
suspect that the bug is relatively intractable.  It may be quite hard
to back up the input stream and recreate the proper lexer state,
especially since I suspect that newlines are not part of the lexer
state but an additional "widget" on the side.

Note, that an alternate solution is to have what PCRE calls lookahead
zero-width assertions.  I think the original LEX had them for this
case.

That is, you don't use yyless for this case, but instead you write a rule like:

TEST1 / \n

where the "/" is an operator in PCRE it would look something like this

TEST1 (?=\n)

In either case, the \n is "peeked" at but not consumed as part of the
token and not otherwise processed at all.  So, you don't have to
"backup" the stream as you never moved it forward over the \n.  It
stays in the lookahead part of the input stream.

The other alternative is to not use yyless and simply deal with the \n
as part of the actions for the TEST1\n rule, making the rule do the
actions of both TEST1 and \n.  Now in this case where it is just one
character and that character is also a token, that probably works ok.
However, if \n# (for example) is a special token, you have gotten
yourself into a mess as you probably need to make TEST1\n# a rule
also, and you can see where this is headed.

But this all goes to my point about not making the lexer complicated.
It is probably a flaw in the language design if you need lookahead
past the end of the token to determine what to tokenize it as.  And,
note several designs by famous people in the compiler field have made
that mistake.  Two examples come immediately to mind.

Yacc itself needs identifier: as a special token to deal with the fact
that semicolons are optional at the end of a rule.  It makes the
language LR(2) otherwise. (In our (Compiler Resources) version of
Yacc++, we disallowed rules that don't end with semicolons.)  Note, I
suspect most versions don't allow comments in between the identifier
and the colon (they probably don't even allow whitespace).

In Pascal 1..2 need lookahead to recognize that it is an int a ..
operator and another int rather than 1.2 a floating point ("real")
number.  Better would have been to require spaces (as in 1 .. 2) in
that case.

Note, I even regret the way our version of Yacc++ handles code blocks
in braces.  It is the same kind of quagmire.  I cannot easily use
braces in other places because of it.  I am looking at having to
develop an even more complicated scheme to make that work, making the
corner case even more convoluted.

But when I rail against hand-written recursive descent parsers and not
using tools and living within the limits of what can express simply,
this is the kind of errors I am referring to.  Without the guardrails
of a tool telling you that your language design has gone off into the
reeds, you will make those kind of errors and your language will be
harder to evolve and have corner cases that you don't even know that
are there.  Enough preaching.  I will get off my soapbox.

Kind regards,
Chris

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------
[Lex and flex both have trailing context. The manual says that if it
can tell at compile time that the text matched by trailing context is
fixed length, it's cheap, otherwise it's very expensive. It also has
REJECT in an action which tells it to pretend that the pattern didn't
match and try something else. That's also very expsnsive, and
potentially very confusing. -John]