Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: RE: An ingenious scanning problem! Date: Wed, 1 Jun 2022 13:34:22 +0300 Organization: Compilers Central Lines: 91 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-06-001@comp.compilers> References: <22-05-053@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="17443"; mail-complaints-to="abuse@iecc.com" Keywords: lex, comment Posted-Date: 01 Jun 2022 13:15:31 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3039 I will skip most of Roger's posting prior to the question: > Question: After the action for TEST1\n is executed, which newline rule is > executed? > > The answer depends on the meaning of yyless(). Here are two possible meanings > for yyless(): > > (a) The scanner is backing up in the input stream. > (b) The scanner is pushing new characters onto the input stream. I find the Flex choice of "b" to be semantically incorrect and a bug. because you cannot get functionality "a" if you do that. However, I suspect that the bug is relatively intractable. It may be quite hard to back up the input stream and recreate the proper lexer state, especially since I suspect that newlines are not part of the lexer state but an additional "widget" on the side. Note, that an alternate solution is to have what PCRE calls lookahead zero-width assertions. I think the original LEX had them for this case. That is, you don't use yyless for this case, but instead you write a rule like: TEST1 / \n where the "/" is an operator in PCRE it would look something like this TEST1 (?=\n) In either case, the \n is "peeked" at but not consumed as part of the token and not otherwise processed at all. So, you don't have to "backup" the stream as you never moved it forward over the \n. It stays in the lookahead part of the input stream. The other alternative is to not use yyless and simply deal with the \n as part of the actions for the TEST1\n rule, making the rule do the actions of both TEST1 and \n. Now in this case where it is just one character and that character is also a token, that probably works ok. However, if \n# (for example) is a special token, you have gotten yourself into a mess as you probably need to make TEST1\n# a rule also, and you can see where this is headed. But this all goes to my point about not making the lexer complicated. It is probably a flaw in the language design if you need lookahead past the end of the token to determine what to tokenize it as. And, note several designs by famous people in the compiler field have made that mistake. Two examples come immediately to mind. Yacc itself needs identifier: as a special token to deal with the fact that semicolons are optional at the end of a rule. It makes the language LR(2) otherwise. (In our (Compiler Resources) version of Yacc++, we disallowed rules that don't end with semicolons.) Note, I suspect most versions don't allow comments in between the identifier and the colon (they probably don't even allow whitespace). In Pascal 1..2 need lookahead to recognize that it is an int a .. operator and another int rather than 1.2 a floating point ("real") number. Better would have been to require spaces (as in 1 .. 2) in that case. Note, I even regret the way our version of Yacc++ handles code blocks in braces. It is the same kind of quagmire. I cannot easily use braces in other places because of it. I am looking at having to develop an even more complicated scheme to make that work, making the corner case even more convoluted. But when I rail against hand-written recursive descent parsers and not using tools and living within the limits of what can express simply, this is the kind of errors I am referring to. Without the guardrails of a tool telling you that your language design has gone off into the reeds, you will make those kind of errors and your language will be harder to evolve and have corner cases that you don't even know that are there. Enough preaching. I will get off my soapbox. Kind regards, Chris -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [Lex and flex both have trailing context. The manual says that if it can tell at compile time that the text matched by trailing context is fixed length, it's cheap, otherwise it's very expensive. It also has REJECT in an action which tells it to pretend that the pattern didn't match and try something else. That's also very expsnsive, and potentially very confusing. -John]