Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <480-992-1380@kylheku.com> Newsgroups: comp.compilers Subject: Re: Why does the lexer convert text integer lexemes to binary integers? I thought that lexers should be simple? Date: Fri, 15 Jul 2022 14:41:33 -0000 (UTC) Organization: A noiseless patient Spider Lines: 107 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-07-023@comp.compilers> References: <22-07-011@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="72493"; mail-complaints-to="abuse@iecc.com" Keywords: lex, design Posted-Date: 15 Jul 2022 12:28:25 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3122 On 2022-07-14, Roger L Costello wrote: > Hi Folks, > > A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic > expressions. When the lexer encounters an integer lexeme, it casts the lexeme > to a binary integer and returns the value to the parser. The lexer contains a > rule that looks something like this: > > {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; } > > But, but, but, ... > > Countless times on this list I have been told: Keep the lexer simple! > > By converting the lexeme to an integer, the lexer has assumed that the parser > needs/wants a binary integer, not a text number. By the facts alone that you have something called "yylval" with a "yylval.intval" member, you're encoding a tight integration with a parser, from which these things are coming. There is no "yylval" global variable in a lex-generated program. > How does the lexer know what > the parser needs/wants? That seems like knowledge the lexer shouldn't have if > the lexer is to be simple. That's like asking, how does the ethernet driver in Linux know that the protocol stack wants a "struct sk_buff *"? That seems like knowledge the driver shouldn't have if it is to be simple, and usable in any operating system. Sometimes you just make integration decisions: you make the pieces agree on some convenient data structure for data interchange and move on, tossing aside concerns like whether a given piece can be reused in any conceivable software system with the greatest possible ease. > Further, even if one parser needs/wants a binary > integer value, that parser might be swapped out at a later date and replaced > with a different parser that wants the text number. Chances are that will never happen; write for today. A lexer which preserves the original lexemes could be useful for software such as code reformatters. If you strongly suspect that you're working on some language that will eventually have tooling that includes a code reformatter, then maybe plan for that in the scanning/parsing. (Right? I mean you clearly don't want a source-to-source beautifier to be rewriting 0xFF as 255, except as carefully controlled option.) A lexer could prepare a token structure which has all the information; a converted semantic item, and the lexeme. > It seems to me that the lexer should return to the parser the text number and > it is the responsibility of the parser to convert the value to an integer data > type if it desires. But how does the parser know what the parser's client wants? What if the parser's client wants a syntax tree with the textual lexemes, and not objects like binary integers? The same argument can be repeated here that the parser should just produce a parse tree with literal tokens, punting the conversion problem higher up. > What do you think? How about this: a given token like INTEGER is processed in one place in the lexer, right? One lexing rule recognizes the integer and can convert the textual integer to a semantic, computational integer. Whereas in the parser's grammar, an INTEGER symbol could appear in multiple places. Would you rather add code to multiple places to convert a textual integer to a value, or do it in one place, and just have the value readily available in those dozens of places? That's the main driver of why we convert semantic values at the lexical level in the lex/yacc stack: so that if we have a rule like whatever : foo INTEGER ':' INTEGER we can just use $2 and $4 to refer to integer values, and not have to do: { int x = get_integer($2); int y = get_integer($4); $$ = function_of(x, y); free_lexeme($2); /* textual lexemes need dynamic alloc! */ free_lexeme($4); } versus: { $$ = function_of($2, $4); } -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal