Re: Why does the lexer convert text integer lexemes to binary integers? I thought that lexers should be simple?

From	George Neuner <gneuner2@comcast.net>
Newsgroups	comp.compilers
Subject	Re: Why does the lexer convert text integer lexemes to binary integers? I thought that lexers should be simple?
Date	2022-07-14 16:38 -0400
Organization	A noiseless patient Spider
Message-ID	<22-07-015@comp.compilers> (permalink)
References	<22-07-011@comp.compilers>

Show all headers | View raw

On Thu, 14 Jul 2022 10:25:24 +0000, Roger L Costello
<costello@mitre.org> wrote:

>Hi Folks,
>
>A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
>expressions. When the lexer encounters an integer lexeme, it casts the lexeme
>to a binary integer and returns the value to the parser. The lexer contains a
>rule that looks something like this:
>
>{INTEGER} 	{ yylval.intval = atoi(yytext); return NUMBER; }
>
>But, but, but, ...
>
>Countless times on this list I have been told: Keep the lexer simple!
>
>By converting the lexeme to an integer, the lexer has assumed that the parser
>needs/wants a binary integer, not a text number. How does the lexer know what
>the parser needs/wants?

Presumably because the same developer that wrote the parser also wrote
the rules for the lexer.

In many (actually most) cases, the binary representation of an integer
can be stored in less space than the text representation.  When lex
and yacc were created, memory was small (and extremely expensive) and
storing things /small/ was very important.

This is also why compilers "intern" symbols and most strings so as to
store only one copy of any given text regardless of how many times it
is seen.

>That seems like knowledge the lexer shouldn't have if
>the lexer is to be simple. Further, even if one parser needs/wants a binary
>integer value, that parser might be swapped out at a later date and replaced
>with a different parser that wants the text number.

That's a valid point, but swapping "components" in that way is rarely,
if ever, done.  IME developers treat the lexer and parser combination
as a single input subsystem.

>It seems to me that the lexer should return to the parser the text number and
>it is the responsibility of the parser to convert the value to an integer data
>type if it desires.
>
>What do you think?

The general problem is performance. The place to decide what class a
token falls into is where you contact the text.

In principle the lexer could be as simple as strtok() - just
separating the tokens, returning a pointer directly to its input
buffer, and punting all further processing to the parser.

But ...

The lexer has already scanned the text just to locate the end of it.
If it then simply gives the text to the parser, the parser must scan
the text AGAIN to verify that it represents a number (or whatever the
parser happens to be looking for at that point).

Then the text gets scanned YET AGAIN to convert it to a number.

That is 3 scans of the text if the parser does the work versus 2 scans
if the lexer does the work.

Now you may think numeric constants are short and occur rarely enough
that this shouldn't matter ... and you'd be correct.

But ...

Remember that other things have to be recognized and classified as
well: in particular language reserved words and symbols, and all kinds
of user chosen names.  So, a (possibly case sensitive) dictionary of
already seen unique text must be maintained.

Searching and maintaining the dictionary will involve yet more passes
over the text: for indexing/hashing, for comparison, and for copying
if the text is new and needs to be preserved.
[These passes are unavoidable and must be done regardless of who does
the work ... but they need to be mentioned nonetheless.]

Moving (at least gross) input classification into the lexer avoids at
least one pass over the entirety of the input.  When considered
against the length of the average input, that can add up to a LOT of
avoided work.

>/Roger
>[I think the lexer should provide the tokens that the parser needs. If
>integers are always handled as numbers, convert them, if not, don't.
>If the parser does one and later changes to do the other, you can
>change the lexer, too. -John]
+1

George

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

csiph-web