Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: Why does the lexer convert text integer lexemes to binary integers? I thought that lexers should be simple?
Date: Fri, 15 Jul 2022 14:41:33 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 107
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-023@comp.compilers>
References: <22-07-011@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="72493"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, design
Posted-Date: 15 Jul 2022 12:28:25 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3122

On 2022-07-14, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
> expressions. When the lexer encounters an integer lexeme, it casts the lexeme
> to a binary integer and returns the value to the parser. The lexer contains a
> rule that looks something like this:
>
> {INTEGER} 	{ yylval.intval = atoi(yytext); return NUMBER; }
>
> But, but, but, ...
>
> Countless times on this list I have been told: Keep the lexer simple!
>
> By converting the lexeme to an integer, the lexer has assumed that the parser
> needs/wants a binary integer, not a text number.

By the facts alone that you have something called "yylval" with a
"yylval.intval" member, you're encoding a tight integration with a
parser, from which these things are coming.

There is no "yylval" global variable in a lex-generated program.

> How does the lexer know what
> the parser needs/wants? That seems like knowledge the lexer shouldn't have if
> the lexer is to be simple.

That's like asking, how does the ethernet driver in Linux know that
the protocol stack wants a "struct sk_buff *"?

That seems like knowledge the driver shouldn't have if it is to be
simple, and usable in any operating system.

Sometimes you just make integration decisions: you make the pieces agree
on some convenient data structure for data interchange and move on,
tossing aside concerns like whether a given piece can be reused in any
conceivable software system with the greatest possible ease.

> Further, even if one parser needs/wants a binary
> integer value, that parser might be swapped out at a later date and replaced
> with a different parser that wants the text number.

Chances are that will never happen; write for today.

A lexer which preserves the original lexemes could be useful for
software such as code reformatters.

If you strongly suspect that you're working on some language that will
eventually have tooling that includes a code reformatter, then maybe
plan for that in the scanning/parsing.

(Right? I mean you clearly don't want a source-to-source beautifier to
be rewriting 0xFF as 255, except as carefully controlled option.)

A lexer could prepare a token structure which has all the information;
a converted semantic item, and the lexeme.

> It seems to me that the lexer should return to the parser the text number and
> it is the responsibility of the parser to convert the value to an integer data
> type if it desires.

But how does the parser know what the parser's client wants? What if
the parser's client wants a syntax tree with the textual lexemes, and
not objects like binary integers?

The same argument can be repeated here that the parser should just
produce a parse tree with literal tokens, punting the conversion problem
higher up.

> What do you think?

How about this: a given token like INTEGER is processed in one place in
the lexer, right? One lexing rule recognizes the integer and can convert
the textual integer to a semantic, computational integer.

Whereas in the parser's grammar, an INTEGER symbol could appear in
multiple places.

Would you rather add code to multiple places to convert a textual
integer to a value, or do it in one place, and just have the value
readily available in those dozens of places?

That's the main driver of why we convert semantic values at the lexical
level in the lex/yacc stack: so that if we have a rule like

   whatever : foo INTEGER ':' INTEGER

we can just use $2 and $4 to refer to integer values, and not have
to do:

   {
      int x = get_integer($2);
      int y = get_integer($4);

      $$ = function_of(x, y);

      free_lexeme($2); /* textual lexemes need dynamic alloc! */
      free_lexeme($4);
   }

versus:

  { $$ = function_of($2, $4); }

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal