Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Jerry <awanderin@gmail.com>
Newsgroups: comp.compilers
Subject: Re: Languages with optional spaces
Date: Thu, 20 Feb 2020 23:38:51 -0700
Organization: A noiseless patient Spider
Lines: 394
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-02-016@comp.compilers>
References: <20-02-015@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="41015"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 21 Feb 2020 11:33:07 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2454

Maury Markowitz <maury.markowitz@gmail.com> writes:

> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).
>
> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:
>
> 100FORI=1TO10
>
> Here's my current patterns that match bits of this line:
>
> FOR         { return FOR; }
>
> [:,;()\^=+\-*/\<\>]     { return yytext[0]; }
>
> [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
>               yylval.d = atof(yytext);
>               return NUMBER;
>             }
>
> "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
>               yylval.s = g_string_new(yytext);
>               return IDENTIFIER;
>             }
>
> These correctly pick out some parts, numbers and = for instance, so it sees:
>
> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.
>
> Is there a way to represent this in lex? Over on Stack Overflow the only
> suggestion seemed to be to use trailing syntax on the keywords, but that
> appears to require modifying every one of simple patterns for keywords with
> some extra (and ugly) syntax. Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".
>
> Is there a canonical cure for this sort of problem that isn't worse than the
> disease?
> [Having written Fortran parsers, not that I've ever found.  I did a prepass
> over each statement to figure out whether it was an assignment or something
> else, then the lexing was straightforward if not pretty. -John]

I have an unfinished compiler for Applesoft BASIC (the Apple II BASIC
written by Microsoft) that I wrote in Python using PLY (a Yacc-like
tool).  Since, as you discovered, the BASICs of the 1970s and 1980s
often didn't use spaces to separate tokens, I tossed the PLY lexer
module out the window and wrote the lexer all myself to interface to the
parser generator.  I don't have the code up on any website so I'll paste
it here and if it provides you with any insight, then great.

Essentially, what it does is at every point in the tokenizer, it tries
to find a token.  This means removing spaces and searching the token
table.  For certain tokens ("ATN", "AT", "TO"), spaces _are_
significant, and the original parser (in 6502 code) had special case
code to handle them.  My parser mimics what the original did as well.
What my lexer does that the original did not do is convert numbers and
strings into lexemes.  The original interpreter rescanned these at
run-time everytime it saw them.  I wanted to hand the parser entire
numbers and strings.

Here is the code:

-----

#!/usr/bin/env python3
"""Applesoft compiler lexical analyzer."""

import re
import sys


def escape_chars(text):
    """Escape regular-expression special characters."""
    for char in '$^+(*=':
        text = text.replace(char, '\\' + char)
    return text


regular_keywords = [
    'END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR',
    'TEXT', 'PR#', 'IN#', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2',
    'HGR', 'HCOLOR=', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT=',
    'SCALE=', 'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE',
    'FLASH', 'COLOR=', 'POP', 'VTAB', 'HIMEM:', 'LOMEM:', 'ONERR',
    'RESUME', 'RECALL', 'STORE', 'SPEED=', 'LET', 'GOTO', 'RUN', 'IF',
    'RESTORE', '&', 'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT',
    'LOAD', 'SAVE', 'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR',
    'GET', 'NEW', 'TAB(', 'TO', 'FN', 'SPC(', 'THEN', 'NOT', 'STEP',
    '+', '-', '*', '/', '^', 'AND', 'OR', '>', '=', '<', 'SGN', 'INT',
    'ABS', 'USR', 'FRE', 'SCRN(', 'PDL', 'POS', 'SQR', 'RND', 'LOG',
    'EXP', 'COS', 'SIN', 'TAN', 'PEEK', 'LEN', 'STR$', 'VAL',
    'ASC', 'CHR$', 'LEFT$', 'RIGHT$', 'MID$']

# AT keyword is handled specially
#  "ATN" ==> <ATN> but "AT N" ==> <AT> "N"
#  "ATO" ==> "A" <TO> but "AT O" ==> <AT> "O"

special_keywords_re = re.compile(
    r'(A *T(?![NO]))|(A *TN)|(A(?= *T *O))', re.IGNORECASE)

keywords_re = re.compile('({})'.format(
    '|'.join([' *'.join(escape_chars(char) for char in kw)
              for kw in regular_keywords])), re.IGNORECASE)

integer_re = re.compile(r'[0-9][0-9 ]*')
float_re = re.compile(r'[\d ]*\.[\d ]*(E *[-+]?[\d ]*)?', re.IGNORECASE)

data_re = re.compile(r'[^":\n]*("[^\n]*")?[^":\n]*(?=[:\n])')
remark_re = re.compile(r'[^\n]*')
variable_re = re.compile(r'([A-Za-z] *[A-Za-z0-9 ]*[\$%]?)')
string_re = re.compile(r'"([^"]*)" *')
space_re = re.compile(r' +')

keyword_map = {
    '&': 'AMPER',
    '*': 'Times',
    '+': 'Plus',
    '-': 'Minus',
    '/': 'Divide',
    '<': 'Less',
    '=': 'Equal',
    '>': 'Greater',
    'CHR$': 'CHR',
    'COLOR=': 'COLOR',
    'HCOLOR=': 'HCOLOR',
    'HIMEM:': 'HIMEM',
    'IN#': 'IN',
    'LEFT$': 'LEFT',
    'LOMEM:': 'LOMEM',
    'MID$': 'MID',
    'PR#': 'PR',
    'RIGHT$': 'RIGHT',
    'ROT=': 'ROT',
    'SCALE=': 'SCALE',
    'SCRN(': 'SCRN',
    'SPC(': 'SPC',
    'SPEED=': 'SPEED',
    'STR$': 'STR',
    'TAB(': 'TAB',
    '^': 'Power'}

tokens = ('END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR',
          'TEXT', 'PR', 'IN', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2', 'HGR',
          'HCOLOR', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT', 'SCALE',
          'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE', 'FLASH', 'COLOR',
          'POP', 'VTAB', 'HIMEM', 'LOMEM', 'ONERR', 'RESUME', 'RECALL',
          'STORE', 'SPEED', 'LET', 'GOTO', 'RUN', 'IF', 'RESTORE', 'AMPER',
          'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT', 'LOAD', 'SAVE',
          'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR', 'GET', 'NEW',
          'TAB', 'TO', 'FN', 'SPC', 'THEN', 'AT', 'NOT', 'STEP', 'AND', 'OR',
          'SGN', 'INT', 'ABS', 'USR', 'FRE', 'SCRN', 'PDL', 'POS', 'SQR',
          'RND', 'LOG', 'EXP', 'COS', 'SIN', 'TAN', 'ATN', 'PEEK', 'LEN',
          'STR', 'VAL', 'ASC', 'CHR', 'LEFT', 'RIGHT', 'MID',
          'Greater', 'Equal', 'Less', 'Plus', 'Minus', 'Times', 'Divide',
          'Power', 'LParen', 'RParen', 'String', 'Newline', 'Variable',
          'Colon', 'Comma', 'Semi', 'Float', 'Integer',)

literals_re = re.compile(r'([:;,\(\)\n])')
literals_map = {
    ':': 'Colon',
    ';': 'Semi',
    ',': 'Comma',
    '(': 'LParen',
    ')': 'RParen',
    '\n': 'Newline'}


class Token:
    """PLY Token: expected lexer return value for parser."""

    def __init__(self, token_type=None, value=None):
        self.type = token_type
        self.value = value
        self.lineno = None
        self.lexpos = None

    def __repr__(self):
        if self.value is not None:
            return '%s[%s]' % (self.type, repr(self.value))
        else:
            return self.type

    def __eq__(self, other):
        return self.type == other.type and self.value == other.value


def filter_text(text):
    """Upper-case and filter out spaces from text."""
    return text.replace(' ', '').upper()


class Lexer:
    """
    Lexer emulates quirky Applesoft tokenizer as closely as possible.

    Thus, it diverges from what a normal lexer would do.  In
    particular, spaces are completely insignificant, except within
    strings, except if it finds "AT" followed by a space and then an
    "N" appears.  That is parsed as "AT N", not "ATN".  Fun? ;)
    """

    def __init__(self, debug=False):
        self.s = ''         # input source string
        self.ix = 0
        self.DEBUG = debug

    def input(self, input_str):
        """Feed the lexical analyzer."""
        self.s = input_str
        self.ix = 0

    def debug(self, token):
        """Generate debug output."""
        if self.DEBUG:
            print(token)

    def find_keyword(self, offset=0):
        """
        Find a keyword.

        Handle the AT keyword in the same way as Applesoft:
         - "ATN" ==> ATN
         - "AT N" ==> AT N
         - "ATO" ==> A TO    This special case returns Variable:A.
         - "AT O" ==> AT O

        Return Token and input-size consumed on success.
        Return None, 0 if no token found.
        """
        t = Token()
        m_keyword = keywords_re.match(self.s[self.ix + offset:])

        if m_keyword:
            t.type = filter_text(m_keyword.group(0))
            length = m_keyword.end()

            if t.type in keyword_map:
                t.value = t.type = keyword_map[t.type]
            elif t.type == 'DATA':
                m_data = data_re.match(self.s[self.ix + offset + length:])
                assert m_data, 'DATA statement is munged'
                t.value = m_data.group(0)
                length += m_data.end()
            elif t.type == 'REM':
                m_remark = remark_re.match(self.s[self.ix + offset + length:])
                assert m_remark, 'REM statement is munged'
                t.value = m_remark.group(0)
                length += m_remark.end()
            return t, length

        m_special_keyword = special_keywords_re.match(
            self.s[self.ix + offset:])
        if m_special_keyword:
            if m_special_keyword.group(1):  # AT
                t.type = 'AT'
            elif m_special_keyword.group(2):  # ATN
                t.type = 'ATN'
            elif m_special_keyword.group(3):  # ATO ==> A (TO is parsed later)
                t.type = 'Variable'
                t.value = 'A'
            else:
                assert False, 'broken special_keyword: {}'.format(
                    m_special_keyword.groups())
            length = m_special_keyword.end()
            return t, length

        return None, 0

    def handle_literal(self, match):
        """Handle literal token."""
        t = Token(literals_map[match.group(1)])
        self.ix += match.end()
        self.debug(t)
        return t

    def handle_string(self, match):
        """Handle string token."""
        t = Token('String', match.group(1))
        self.ix += match.end()
        self.debug(t)
        return t

    def handle_float(self, match_float):
        """
        Handle floating-point token.

        Applesoft allows odd floats that must be converted:
          .     ==>  0.0
          .E    ==>  0.0
          123E  ==> 123.0
        """
        f = filter_text(match_float.group(0))
        if f.endswith('E'):
            f = f[:-1]
        if f == '.':
            f = '0.0'
        t = Token('Float', float(f))
        self.ix += match_float.end()
        self.debug(t)
        return t

    def handle_integer(self, match_int):
        """Handle integer token."""
        t = Token('Integer', int(filter_text(match_int.group(0))))
        self.ix += match_int.end()
        self.debug(t)
        return t

    def token(self):
        """
        Return a token to the parser.

        The Applesoft parser tokenizes keywords, identifies string
        literals, and leaves everything else as plain text (numbers,
        variables, and punctuation).  This parser treats the keywords
        the same way, but also creates special tokens for strings,
        integers, and floating-point numbers.

        Look for things in this order:
        - keywords
        - special keywords (ATN AT TO)
        - punctuation (literals)
        - strings
        - floats
        - integers
        - variables
        """
        m = space_re.match(self.s[self.ix:])
        if m:
            self.ix += m.end()

        if self.ix >= len(self.s):
            return None

        t, length = self.find_keyword()
        if t:
            self.ix += length
            self.debug(t)
            return t

        m = literals_re.match(self.s[self.ix:])
        if m:
            return self.handle_literal(m)

        m = string_re.match(self.s[self.ix:])
        if m:
            return self.handle_string(m)

        m_float = float_re.match(self.s[self.ix:])
        if m_float:
            return self.handle_float(m_float)

        m_int = integer_re.match(self.s[self.ix:])
        if m_int:
            return self.handle_integer(m_int)

        t = Token()
        m = variable_re.match(self.s[self.ix:])
        if m:
            t.type = 'Variable'

            #  If a keyword is found in any suffix of the variable
            #  name, exclude that suffix.
            for i in range(1, m.end()):
                token_ahead, length = self.find_keyword(i)
                if token_ahead:
                    t.value = filter_text(self.s[self.ix: self.ix + i])
                    assert t.value, 'Bad variable parse: {}'.format(
                        self.s[self.ix: self.ix + m.end()])
                    self.ix += i
                    self.debug(t)
                    return t

            t.value = filter_text(m.group(0))
            self.ix += m.end()
            self.debug(t)
            return t

        t.type = 'Char'
        t.value = self.s[self.ix]
        self.ix += 1
        self.debug(t)
        return t

--
Jerry    awanderin at gmail dot com