Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <493-878-3164@kylheku.com> Newsgroups: comp.compilers Subject: Re: Languages with optional spaces Date: Wed, 26 Feb 2020 08:06:04 +0000 (UTC) Organization: Aioe.org NNTP Server Lines: 96 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-02-021@comp.compilers> References: <20-02-015@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="4321"; mail-complaints-to="abuse@iecc.com" Keywords: lex, Basic, history Posted-Date: 27 Feb 2020 17:33:44 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2459 On 2020-02-19, Maury Markowitz wrote: > I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs > like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk > of 101 BASIC Games (DEF FN is the last feature to add). > > Then I got to Super Star Trek. To save memory, SST removes most spaces, so > lines look like this: > > 100FORI=1TO10 > > Here's my current patterns that match bits of this line: > > FOR { return FOR; } > > [:,;()\^=+\-*/\<\>] { return yytext[0]; } > > [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? { > yylval.d = atof(yytext); > return NUMBER; > } > > "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? { > yylval.s = g_string_new(yytext); > return IDENTIFIER; > } > > These correctly pick out some parts, numbers and = for instance, so it sees: > > 100 FORI = 1 TO 10 > > The problem is that FORI part. Some BASICs allow variable names with more than > two characters, so in theory, FORI could be a variable. These BASICs outlaw > that in their parsers; any string that starts with a keyword exits then, so > this would always parse as FOR. In lex, FORI is longer than FOR, so it returns > a variable token called FORI. > > Is there a way to represent this in lex? Over on Stack Overflow the only > suggestion seemed to be to use trailing syntax on the keywords, but that > appears to require modifying every one of simple patterns for keywords with > some extra (and ugly) syntax. Likewise, one might modify the variable name > pattern, but I'm not sure how one says "everything that doesn't start with one > of these other 110 patterns". Two ideas: 1. Just forget recognizing variable names in the lexer. Instead, recognize only the constituent letter of a variable name in the lexer. Then in the parser, have a grammar production which converts the letters of a variable into a variable. variable : VARCHAR | variable VARCHAR ; 2. Use regex patterns in the lexer to recognize just the keywords, as a above. Then, recognition of variable names is handled by matching just one letter A-Z, whose lex action performs ad-hoc lexical analysis using C logic. At that point you know that you do not have a keyword, because no keyword rule matched. You can read characters using YYIN and accumulate a variable name. A variant of technique (2) is used for scanning C comments, as an alternative to an ugly regular expression: "/*" { int c; while ((c = yyinput()) != 0) { if (c == '\n') { /* increment line number or something */ } else if (c == '*') { if ((c = yyinput()) == '/') break; else unput(c); } } } The above is an adaptation of something from an old Flex manual. IIRC the Dragon Book has a similar example of ad-hoc logic in a lex rule for handling C comments. You can see that it's a similar idea. We use a regex to partially match the comment, just the /* opening. Then we take over from there. I have a hunch this would work for fetching variables like FORI, when there is no match on a keyword like FOR. -- TXR Programming Lanuage: http://nongnu.org/txr Music DIY Mailing List: http://www.kylheku.com/diy ADA MP-1 Mailing List: http://www.kylheku.com/mp1