Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Maury Markowitz Newsgroups: comp.compilers Subject: Languages with optional spaces Date: Wed, 19 Feb 2020 07:35:59 -0800 (PST) Organization: Compilers Central Lines: 47 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-02-015@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="74218"; mail-complaints-to="abuse@iecc.com" Keywords: lex, question, comment Posted-Date: 19 Feb 2020 11:24:00 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2453 I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk of 101 BASIC Games (DEF FN is the last feature to add). Then I got to Super Star Trek. To save memory, SST removes most spaces, so lines look like this: 100FORI=1TO10 Here's my current patterns that match bits of this line: FOR { return FOR; } [:,;()\^=+\-*/\<\>] { return yytext[0]; } [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? { yylval.d = atof(yytext); return NUMBER; } "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? { yylval.s = g_string_new(yytext); return IDENTIFIER; } These correctly pick out some parts, numbers and = for instance, so it sees: 100 FORI = 1 TO 10 The problem is that FORI part. Some BASICs allow variable names with more than two characters, so in theory, FORI could be a variable. These BASICs outlaw that in their parsers; any string that starts with a keyword exits then, so this would always parse as FOR. In lex, FORI is longer than FOR, so it returns a variable token called FORI. Is there a way to represent this in lex? Over on Stack Overflow the only suggestion seemed to be to use trailing syntax on the keywords, but that appears to require modifying every one of simple patterns for keywords with some extra (and ugly) syntax. Likewise, one might modify the variable name pattern, but I'm not sure how one says "everything that doesn't start with one of these other 110 patterns". Is there a canonical cure for this sort of problem that isn't worse than the disease? [Having written Fortran parsers, not that I've ever found. I did a prepass over each statement to figure out whether it was an assignment or something else, then the lexing was straightforward if not pretty. -John]