Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <493-878-3164@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: Languages with optional spaces
Date: Wed, 26 Feb 2020 08:06:04 +0000 (UTC)
Organization: Aioe.org NNTP Server
Lines: 96
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-02-021@comp.compilers>
References: <20-02-015@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="4321"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, Basic, history
Posted-Date: 27 Feb 2020 17:33:44 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2459

On 2020-02-19, Maury Markowitz <maury.markowitz@gmail.com> wrote:
> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).
>
> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:
>
> 100FORI=1TO10
>
> Here's my current patterns that match bits of this line:
>
> FOR         { return FOR; }
>
> [:,;()\^=+\-*/\<\>]     { return yytext[0]; }
>
> [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
>               yylval.d = atof(yytext);
>               return NUMBER;
>             }
>
> "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
>               yylval.s = g_string_new(yytext);
>               return IDENTIFIER;
>             }
>
> These correctly pick out some parts, numbers and = for instance, so it sees:
>
> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.
>
> Is there a way to represent this in lex? Over on Stack Overflow the only
> suggestion seemed to be to use trailing syntax on the keywords, but that
> appears to require modifying every one of simple patterns for keywords with
> some extra (and ugly) syntax. Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".

Two ideas:

1. Just forget recognizing variable names in the lexer. Instead,
recognize only the constituent letter of a variable name in the lexer.
Then in the parser, have a grammar production which converts
the letters of a variable into a variable.

   variable : VARCHAR
            | variable VARCHAR
            ;

2. Use regex patterns in the lexer to recognize just the keywords,
as a above.  Then, recognition of variable names is handled by
matching just one letter A-Z, whose lex action performs ad-hoc
lexical analysis using C logic. At that point you know that you do not
have a keyword, because no keyword rule matched. You can read
characters using YYIN and accumulate a variable name.

A variant of technique (2) is used for scanning C comments,
as an alternative to an ugly regular expression:

  "/*"  {
          int c;

          while ((c = yyinput()) != 0)
          {
            if (c == '\n') {
              /* increment line number or something */
            }
            else if (c == '*')
            {
              if ((c = yyinput()) == '/')
                break;
              else
                unput(c);
            }
          }
        }

The above is an adaptation of something from an old Flex manual.
IIRC the Dragon Book has a similar example of ad-hoc logic
in a lex rule for handling C comments.

You can see that it's a similar idea. We use a regex to partially match
the comment, just the /* opening. Then we take over from there.

I have a hunch this would work for fetching variables like FORI, when
there is no match on a keyword like FOR.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List:  http://www.kylheku.com/diy
ADA MP-1 Mailing List:   http://www.kylheku.com/mp1