Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.compilers > #2453 > unrolled thread

Languages with optional spaces

Started byMaury Markowitz <maury.markowitz@gmail.com>
First post2020-02-19 07:35 -0800
Last post2020-05-05 13:05 -0700
Articles 20 on this page of 22 — 12 participants

Back to article view | Back to comp.compilers


Contents

  Languages with optional spaces Maury Markowitz <maury.markowitz@gmail.com> - 2020-02-19 07:35 -0800
    Re: Languages with optional spaces Jerry <awanderin@gmail.com> - 2020-02-20 23:38 -0700
      Re: Languages with optional spaces Maury Markowitz <maury.markowitz@gmail.com> - 2020-02-25 06:13 -0800
        Re: Languages with optional spaces awanderin <awanderin@gmail.com> - 2020-02-26 10:03 -0700
      Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-03-12 17:45 +0200
    Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-02-23 12:33 +0200
      Re: Languages with optional spaces Martin Ward <martin@gkc.org.uk> - 2020-02-25 17:00 +0000
        Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-02-28 13:34 +0200
      Re: Languages with optional spaces Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-02-29 11:48 +0200
        Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-02-29 21:38 +0200
          Re: Languages with optional spaces Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-03-01 10:07 +0200
          Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-03-01 19:41 +0200
            Re: Languages with optional spaces Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-03-02 08:33 +0200
              Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-03-02 20:04 +0200
        Re: Languages with optional spaces Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-03-01 00:28 +0100
    Re: Languages with optional spaces Maury Markowitz <maury.markowitz@gmail.com> - 2020-02-25 06:11 -0800
    Re: Languages with optional spaces Kaz Kylheku <493-878-3164@kylheku.com> - 2020-02-26 08:06 +0000
    Re: Languages with optional spaces and tools Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-02-28 20:16 +0100
    Re: Languages with optional spaces gah4@u.washington.edu - 2020-03-02 21:12 -0800
    Re: Languages with optional spaces Gene <gene.ressler@gmail.com> - 2020-04-14 10:08 -0700
    Re: Languages with optional spaces mertesthomas@gmail.com - 2020-04-19 04:04 -0700
    Re: Languages with optional spaces aston.goldsmith@gmail.com - 2020-05-05 13:05 -0700

Page 1 of 2  [1] 2  Next page →


#2453 — Languages with optional spaces

FromMaury Markowitz <maury.markowitz@gmail.com>
Date2020-02-19 07:35 -0800
SubjectLanguages with optional spaces
Message-ID<20-02-015@comp.compilers>
I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
of 101 BASIC Games (DEF FN is the last feature to add).

Then I got to Super Star Trek. To save memory, SST removes most spaces, so
lines look like this:

100FORI=1TO10

Here's my current patterns that match bits of this line:

FOR         { return FOR; }

[:,;()\^=+\-*/\<\>]     { return yytext[0]; }

[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
              yylval.d = atof(yytext);
              return NUMBER;
            }

"FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
              yylval.s = g_string_new(yytext);
              return IDENTIFIER;
            }

These correctly pick out some parts, numbers and = for instance, so it sees:

100 FORI = 1 TO 10

The problem is that FORI part. Some BASICs allow variable names with more than
two characters, so in theory, FORI could be a variable. These BASICs outlaw
that in their parsers; any string that starts with a keyword exits then, so
this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
a variable token called FORI.

Is there a way to represent this in lex? Over on Stack Overflow the only
suggestion seemed to be to use trailing syntax on the keywords, but that
appears to require modifying every one of simple patterns for keywords with
some extra (and ugly) syntax. Likewise, one might modify the variable name
pattern, but I'm not sure how one says "everything that doesn't start with one
of these other 110 patterns".

Is there a canonical cure for this sort of problem that isn't worse than the
disease?
[Having written Fortran parsers, not that I've ever found.  I did a prepass
over each statement to figure out whether it was an assignment or something
else, then the lexing was straightforward if not pretty. -John]

[toc] | [next] | [standalone]


#2454

FromJerry <awanderin@gmail.com>
Date2020-02-20 23:38 -0700
Message-ID<20-02-016@comp.compilers>
In reply to#2453
Maury Markowitz <maury.markowitz@gmail.com> writes:

> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).
>
> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:
>
> 100FORI=1TO10
>
> Here's my current patterns that match bits of this line:
>
> FOR         { return FOR; }
>
> [:,;()\^=+\-*/\<\>]     { return yytext[0]; }
>
> [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
>               yylval.d = atof(yytext);
>               return NUMBER;
>             }
>
> "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
>               yylval.s = g_string_new(yytext);
>               return IDENTIFIER;
>             }
>
> These correctly pick out some parts, numbers and = for instance, so it sees:
>
> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.
>
> Is there a way to represent this in lex? Over on Stack Overflow the only
> suggestion seemed to be to use trailing syntax on the keywords, but that
> appears to require modifying every one of simple patterns for keywords with
> some extra (and ugly) syntax. Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".
>
> Is there a canonical cure for this sort of problem that isn't worse than the
> disease?
> [Having written Fortran parsers, not that I've ever found.  I did a prepass
> over each statement to figure out whether it was an assignment or something
> else, then the lexing was straightforward if not pretty. -John]

I have an unfinished compiler for Applesoft BASIC (the Apple II BASIC
written by Microsoft) that I wrote in Python using PLY (a Yacc-like
tool).  Since, as you discovered, the BASICs of the 1970s and 1980s
often didn't use spaces to separate tokens, I tossed the PLY lexer
module out the window and wrote the lexer all myself to interface to the
parser generator.  I don't have the code up on any website so I'll paste
it here and if it provides you with any insight, then great.

Essentially, what it does is at every point in the tokenizer, it tries
to find a token.  This means removing spaces and searching the token
table.  For certain tokens ("ATN", "AT", "TO"), spaces _are_
significant, and the original parser (in 6502 code) had special case
code to handle them.  My parser mimics what the original did as well.
What my lexer does that the original did not do is convert numbers and
strings into lexemes.  The original interpreter rescanned these at
run-time everytime it saw them.  I wanted to hand the parser entire
numbers and strings.

Here is the code:

-----

#!/usr/bin/env python3
"""Applesoft compiler lexical analyzer."""

import re
import sys


def escape_chars(text):
    """Escape regular-expression special characters."""
    for char in '$^+(*=':
        text = text.replace(char, '\\' + char)
    return text


regular_keywords = [
    'END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR',
    'TEXT', 'PR#', 'IN#', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2',
    'HGR', 'HCOLOR=', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT=',
    'SCALE=', 'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE',
    'FLASH', 'COLOR=', 'POP', 'VTAB', 'HIMEM:', 'LOMEM:', 'ONERR',
    'RESUME', 'RECALL', 'STORE', 'SPEED=', 'LET', 'GOTO', 'RUN', 'IF',
    'RESTORE', '&', 'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT',
    'LOAD', 'SAVE', 'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR',
    'GET', 'NEW', 'TAB(', 'TO', 'FN', 'SPC(', 'THEN', 'NOT', 'STEP',
    '+', '-', '*', '/', '^', 'AND', 'OR', '>', '=', '<', 'SGN', 'INT',
    'ABS', 'USR', 'FRE', 'SCRN(', 'PDL', 'POS', 'SQR', 'RND', 'LOG',
    'EXP', 'COS', 'SIN', 'TAN', 'PEEK', 'LEN', 'STR$', 'VAL',
    'ASC', 'CHR$', 'LEFT$', 'RIGHT$', 'MID$']

# AT keyword is handled specially
#  "ATN" ==> <ATN> but "AT N" ==> <AT> "N"
#  "ATO" ==> "A" <TO> but "AT O" ==> <AT> "O"

special_keywords_re = re.compile(
    r'(A *T(?![NO]))|(A *TN)|(A(?= *T *O))', re.IGNORECASE)

keywords_re = re.compile('({})'.format(
    '|'.join([' *'.join(escape_chars(char) for char in kw)
              for kw in regular_keywords])), re.IGNORECASE)

integer_re = re.compile(r'[0-9][0-9 ]*')
float_re = re.compile(r'[\d ]*\.[\d ]*(E *[-+]?[\d ]*)?', re.IGNORECASE)

data_re = re.compile(r'[^":\n]*("[^\n]*")?[^":\n]*(?=[:\n])')
remark_re = re.compile(r'[^\n]*')
variable_re = re.compile(r'([A-Za-z] *[A-Za-z0-9 ]*[\$%]?)')
string_re = re.compile(r'"([^"]*)" *')
space_re = re.compile(r' +')

keyword_map = {
    '&': 'AMPER',
    '*': 'Times',
    '+': 'Plus',
    '-': 'Minus',
    '/': 'Divide',
    '<': 'Less',
    '=': 'Equal',
    '>': 'Greater',
    'CHR$': 'CHR',
    'COLOR=': 'COLOR',
    'HCOLOR=': 'HCOLOR',
    'HIMEM:': 'HIMEM',
    'IN#': 'IN',
    'LEFT$': 'LEFT',
    'LOMEM:': 'LOMEM',
    'MID$': 'MID',
    'PR#': 'PR',
    'RIGHT$': 'RIGHT',
    'ROT=': 'ROT',
    'SCALE=': 'SCALE',
    'SCRN(': 'SCRN',
    'SPC(': 'SPC',
    'SPEED=': 'SPEED',
    'STR$': 'STR',
    'TAB(': 'TAB',
    '^': 'Power'}

tokens = ('END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR',
          'TEXT', 'PR', 'IN', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2', 'HGR',
          'HCOLOR', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT', 'SCALE',
          'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE', 'FLASH', 'COLOR',
          'POP', 'VTAB', 'HIMEM', 'LOMEM', 'ONERR', 'RESUME', 'RECALL',
          'STORE', 'SPEED', 'LET', 'GOTO', 'RUN', 'IF', 'RESTORE', 'AMPER',
          'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT', 'LOAD', 'SAVE',
          'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR', 'GET', 'NEW',
          'TAB', 'TO', 'FN', 'SPC', 'THEN', 'AT', 'NOT', 'STEP', 'AND', 'OR',
          'SGN', 'INT', 'ABS', 'USR', 'FRE', 'SCRN', 'PDL', 'POS', 'SQR',
          'RND', 'LOG', 'EXP', 'COS', 'SIN', 'TAN', 'ATN', 'PEEK', 'LEN',
          'STR', 'VAL', 'ASC', 'CHR', 'LEFT', 'RIGHT', 'MID',
          'Greater', 'Equal', 'Less', 'Plus', 'Minus', 'Times', 'Divide',
          'Power', 'LParen', 'RParen', 'String', 'Newline', 'Variable',
          'Colon', 'Comma', 'Semi', 'Float', 'Integer',)

literals_re = re.compile(r'([:;,\(\)\n])')
literals_map = {
    ':': 'Colon',
    ';': 'Semi',
    ',': 'Comma',
    '(': 'LParen',
    ')': 'RParen',
    '\n': 'Newline'}


class Token:
    """PLY Token: expected lexer return value for parser."""

    def __init__(self, token_type=None, value=None):
        self.type = token_type
        self.value = value
        self.lineno = None
        self.lexpos = None

    def __repr__(self):
        if self.value is not None:
            return '%s[%s]' % (self.type, repr(self.value))
        else:
            return self.type

    def __eq__(self, other):
        return self.type == other.type and self.value == other.value


def filter_text(text):
    """Upper-case and filter out spaces from text."""
    return text.replace(' ', '').upper()


class Lexer:
    """
    Lexer emulates quirky Applesoft tokenizer as closely as possible.

    Thus, it diverges from what a normal lexer would do.  In
    particular, spaces are completely insignificant, except within
    strings, except if it finds "AT" followed by a space and then an
    "N" appears.  That is parsed as "AT N", not "ATN".  Fun? ;)
    """

    def __init__(self, debug=False):
        self.s = ''         # input source string
        self.ix = 0
        self.DEBUG = debug

    def input(self, input_str):
        """Feed the lexical analyzer."""
        self.s = input_str
        self.ix = 0

    def debug(self, token):
        """Generate debug output."""
        if self.DEBUG:
            print(token)

    def find_keyword(self, offset=0):
        """
        Find a keyword.

        Handle the AT keyword in the same way as Applesoft:
         - "ATN" ==> ATN
         - "AT N" ==> AT N
         - "ATO" ==> A TO    This special case returns Variable:A.
         - "AT O" ==> AT O

        Return Token and input-size consumed on success.
        Return None, 0 if no token found.
        """
        t = Token()
        m_keyword = keywords_re.match(self.s[self.ix + offset:])

        if m_keyword:
            t.type = filter_text(m_keyword.group(0))
            length = m_keyword.end()

            if t.type in keyword_map:
                t.value = t.type = keyword_map[t.type]
            elif t.type == 'DATA':
                m_data = data_re.match(self.s[self.ix + offset + length:])
                assert m_data, 'DATA statement is munged'
                t.value = m_data.group(0)
                length += m_data.end()
            elif t.type == 'REM':
                m_remark = remark_re.match(self.s[self.ix + offset + length:])
                assert m_remark, 'REM statement is munged'
                t.value = m_remark.group(0)
                length += m_remark.end()
            return t, length

        m_special_keyword = special_keywords_re.match(
            self.s[self.ix + offset:])
        if m_special_keyword:
            if m_special_keyword.group(1):  # AT
                t.type = 'AT'
            elif m_special_keyword.group(2):  # ATN
                t.type = 'ATN'
            elif m_special_keyword.group(3):  # ATO ==> A (TO is parsed later)
                t.type = 'Variable'
                t.value = 'A'
            else:
                assert False, 'broken special_keyword: {}'.format(
                    m_special_keyword.groups())
            length = m_special_keyword.end()
            return t, length

        return None, 0

    def handle_literal(self, match):
        """Handle literal token."""
        t = Token(literals_map[match.group(1)])
        self.ix += match.end()
        self.debug(t)
        return t

    def handle_string(self, match):
        """Handle string token."""
        t = Token('String', match.group(1))
        self.ix += match.end()
        self.debug(t)
        return t

    def handle_float(self, match_float):
        """
        Handle floating-point token.

        Applesoft allows odd floats that must be converted:
          .     ==>  0.0
          .E    ==>  0.0
          123E  ==> 123.0
        """
        f = filter_text(match_float.group(0))
        if f.endswith('E'):
            f = f[:-1]
        if f == '.':
            f = '0.0'
        t = Token('Float', float(f))
        self.ix += match_float.end()
        self.debug(t)
        return t

    def handle_integer(self, match_int):
        """Handle integer token."""
        t = Token('Integer', int(filter_text(match_int.group(0))))
        self.ix += match_int.end()
        self.debug(t)
        return t

    def token(self):
        """
        Return a token to the parser.

        The Applesoft parser tokenizes keywords, identifies string
        literals, and leaves everything else as plain text (numbers,
        variables, and punctuation).  This parser treats the keywords
        the same way, but also creates special tokens for strings,
        integers, and floating-point numbers.

        Look for things in this order:
        - keywords
        - special keywords (ATN AT TO)
        - punctuation (literals)
        - strings
        - floats
        - integers
        - variables
        """
        m = space_re.match(self.s[self.ix:])
        if m:
            self.ix += m.end()

        if self.ix >= len(self.s):
            return None

        t, length = self.find_keyword()
        if t:
            self.ix += length
            self.debug(t)
            return t

        m = literals_re.match(self.s[self.ix:])
        if m:
            return self.handle_literal(m)

        m = string_re.match(self.s[self.ix:])
        if m:
            return self.handle_string(m)

        m_float = float_re.match(self.s[self.ix:])
        if m_float:
            return self.handle_float(m_float)

        m_int = integer_re.match(self.s[self.ix:])
        if m_int:
            return self.handle_integer(m_int)

        t = Token()
        m = variable_re.match(self.s[self.ix:])
        if m:
            t.type = 'Variable'

            #  If a keyword is found in any suffix of the variable
            #  name, exclude that suffix.
            for i in range(1, m.end()):
                token_ahead, length = self.find_keyword(i)
                if token_ahead:
                    t.value = filter_text(self.s[self.ix: self.ix + i])
                    assert t.value, 'Bad variable parse: {}'.format(
                        self.s[self.ix: self.ix + m.end()])
                    self.ix += i
                    self.debug(t)
                    return t

            t.value = filter_text(m.group(0))
            self.ix += m.end()
            self.debug(t)
            return t

        t.type = 'Char'
        t.value = self.s[self.ix]
        self.ix += 1
        self.debug(t)
        return t

--
Jerry    awanderin at gmail dot com

[toc] | [prev] | [next] | [standalone]


#2457

FromMaury Markowitz <maury.markowitz@gmail.com>
Date2020-02-25 06:13 -0800
Message-ID<20-02-019@comp.compilers>
In reply to#2454
Unless I'm mistaken, Integer BASIC is a "deep tokenizer", so is it not the case the spaces exist only at entry time? IE, if one types 'PRINT"HELLO"' that will actually be LISTed as 'PRINT "HELLO"?
[That is my recollection.  Storing tokens takes less space than storing the input text and on those tiny little
micros, every byte counted. -John]

[toc] | [prev] | [next] | [standalone]


#2460

Fromawanderin <awanderin@gmail.com>
Date2020-02-26 10:03 -0700
Message-ID<20-02-022@comp.compilers>
In reply to#2457
Maury Markowitz <maury.markowitz@gmail.com> writes:

> Unless I'm mistaken, Integer BASIC is a "deep tokenizer", so is it not
> the case the spaces exist only at entry time? IE, if one types
> 'PRINT"HELLO"' that will actually be LISTed as 'PRINT "HELLO"?
> [That is my recollection.  Storing tokens takes less space than
> storing the input text and on those tiny little
> micros, every byte counted. -John]

In the Apple II Microsoft BASIC, yes, the tokenizer removed spaces
except within strings.  On the Commodore machines, their version of
Microsoft BASIC did not purge the spaces.  This allowed the programmer
to "indent" the code (after the line number).  The tokens, like PRINT,
FOR, INPUT, etc, are stored as one byte in all the Microsoft BASICs.
The Apple II Integer BASIC also tokenizes to one byte and removes
spaces, but it did more syntax checking at entry time, so it was more a
lexer + parser than a lexer like Microsoft BASIC.

--
--
Jerry    awanderin at gmail dot com

[toc] | [prev] | [next] | [standalone]


#2483

From"Ev. Drikos" <drikosev@gmail.com>
Date2020-03-12 17:45 +0200
Message-ID<20-03-012@comp.compilers>
In reply to#2454
On 21/02/2020 08:38, Jerry wrote:
> class Lexer:
>      """
>      Lexer emulates quirky Applesoft tokenizer as closely as possible.
>
>      Thus, it diverges from what a normal lexer would do.  In
>      particular, spaces are completely insignificant, except within
>      strings, except if it finds "AT" followed by a space and then an
>      "N" appears.  That is parsed as "AT N", not "ATN".  Fun?;)
>      """

Hello,

Lacking any Python skills, I can't test the lexer but if this comment is
accurate, then any spaces in literals should normally be ignored.

The AppleSoft II manual in page 69 at www.classiccmp.org/cini/pdf/Apple/
implies that spaces are also important in literals after first position,
similar comment in the Basic dialect supported by the Compukit UK101 at
http://uk101.sourceforge.net/docs/index.html

Rather than critique, it's a detail I'm interested in, see next question.

Ev. Drikos

[toc] | [prev] | [next] | [standalone]


#2455

From"Ev. Drikos" <drikosev@gmail.com>
Date2020-02-23 12:33 +0200
Message-ID<20-02-017@comp.compilers>
In reply to#2453
On 19/02/2020 17:35, Maury Markowitz wrote:
> ... Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".
>

This should be relatively simple with a scanner generator that supports
intersection and negation operators (or difference operators), but IMHO
such a rule would work ie if FORI wasn't a valid variable name. Still I
can't propose to you a specific tool though.

With a tool like flex, I'd try to see if different start states do work:
https://www.cs.virginia.edu/~cr4bd/flex-manual/Start-Conditions.html

Ev. Drikos
[You can try start states but in my experience if the tokenizing rules
are very context sensitive, it's easier to give up and hand-code the
lexer.  The lexical syntax of Basic isn't that big. -John]

[toc] | [prev] | [next] | [standalone]


#2458

FromMartin Ward <martin@gkc.org.uk>
Date2020-02-25 17:00 +0000
Message-ID<20-02-020@comp.compilers>
In reply to#2455
On 23/02/20 10:33, Ev. Drikos wrote:
> [You can try start states but in my experience if the tokenizing rules
> are very context sensitive, it's easier to give up and hand-code the
> lexer.  The lexical syntax of Basic isn't that big. -John]

The BASIC for the Compukit UK101 and Ohio Superboard stores
each line as a sequence of characters. As a line is entered
any keywords appearing anywhere on the line are replaced
by control characters. Keywords are not allowed anywhere
within a variable name. So any statement either starts
with a keyword (a control character), which determines
the statement type, or it is an assignment with the LET
keyword omitted.

Due to the small amount of memory available, variable names
are typically kept short and spaces are omitted. Short variable
names are less likely to include an embedded keyword.
Also, only the first two characters of a variable name are significant.
The program will also run faster with short variable names:
since the BASIC interpreter scans each line as it is interpreted.

More information and some sample BASIC software:

http://www.gkc.org.uk/martin/software/#UK101

A procedurally generated, open world, RTS (real time strategy) game
with destructable environments which runs in 8K of RAM:

http://www.gkc.org.uk/martin/software/startrek.html

--
			Martin

Dr Martin Ward | Email: martin@gkc.org.uk | http://www.gkc.org.uk
G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4

[toc] | [prev] | [next] | [standalone]


#2467

From"Ev. Drikos" <drikosev@gmail.com>
Date2020-02-28 13:34 +0200
Message-ID<20-02-030@comp.compilers>
In reply to#2458
On 25/02/2020 19:00, Martin Ward wrote:
> ...So any statement either starts
> with a keyword (a control character), which determines
> the statement type, or it is an assignment with the LET
> keyword omitted...


A scanner generator that supports ie "difference" operators can build a
DFA that splits ie the text FORI to FOR & I without space overhead, if
the reserved keywords are also tokens of the lexer.

The lexical rules below result ie to a DFA with 10 states. Even if I
erase the difference operator "-=" till the end of the document, the
DFA built still has 10 states but the text FORI is returned as an "id".

So, the rules below ensure that an "id" doesn't start with FOR or PRINT,
not sure though if this is what the OP really wanted once he said that
in theory FORI could be a variable (perhaps it can't be just an LHS?).


Ev. Drikos

-----------------------------------------------------------------------
token ::=
            FOR
      |     PRINT
      |     id

FOR ::=
            F O R

PRINT ::=
            P R I N T

id ::=
            {A..Z [{$|A .. Z|0 .. 9}...]} -= {key[{$|A .. Z|0..9}...]}

key ::=
            FOR
      |     PRINT

[toc] | [prev] | [next] | [standalone]


#2470

FromChristopher F Clark <christopher.f.clark@compiler-resources.com>
Date2020-02-29 11:48 +0200
Message-ID<20-02-033@comp.compilers>
In reply to#2455
"Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution
to the problem of keywords being part of identifiers in languages with
optional spaces.
I won't include it here.

The problem is that some keywords can appear at places other than the
beginning of an identifier.
In fact, in the worst case scenario, the language can be ambiguous.
Consider the following "BASIC" program extended with variables that
are more than one letter long
and spaces being optional.

10 LET ITO = 1
20 LET I = 2
30 LET JTOK = 3
40 LET K = 4
50 FOR N = ITOJTOK
60 REM AMBIGUOUS FOR N = I TO JTOK
70 REM OR FOR N = ITOJ TO K
80 PRINT N;
90 NEXT N
100 END

The problem with such solutions is one is tempted to "fix" them one by
one as they are encountered.

Maury Markowitz <maury.markowitz@gmail.com> mentioned this in his post
where ATO was considered.
It could be A TO or AT O (presuming that TO and AT are both keywords)
Note that this is even an issue with 1 letter variable names if one
has both keywords.

As one starts patching up these cases, the "grammar"
(or its recursive descent implementation most likely)
begins to become what I call "ad hack".

With a GLR parser (or something equivalent in power, e.g. an Earley
parser or CYK) and a lexer that returns all possible sets of
tokenizations one can find all the relevant parse trees and then see
if only 1 makes semantic sense.

In the above example, that won't help as both interpretations are
legal programs.
One prints 2 3, the other 1 2 3 4.

I cannot imagine a programmer being happy with the error message:
LINE 50 AMBIGUOUS STATEMENT.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------
[I get the impression that more often than not, whoever wrote the interpreter
didn't give it much thought so the grammar is whatever the 6502 code did thirty
years ago. Fortran was ugly but at least it wasn't ambiguous and at each
point the lexer knew what tokens were valid. -John]

[toc] | [prev] | [next] | [standalone]


#2471

From"Ev. Drikos" <drikosev@gmail.com>
Date2020-02-29 21:38 +0200
Message-ID<20-02-034@comp.compilers>
In reply to#2470
On 29/02/2020 11:48, Christopher F Clark wrote:
> "Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution
> to the problem of keywords being part of identifiers in languages with
> optional spaces.

It's only an example for the beginning of statements, based on the
description I read in other postings. It couldn't be the solution of
an unspecified problem. The terms "usually" & "in theory" aren't specs.

In contrast, one who wants to code a parser for the particular dialect
of BASIC should know ie what the 6502 or the Compukit UK101 code did.

> The problem is that some keywords can appear at places other than the
> beginning of an identifier.
> In fact, in the worst case scenario, the language can be ambiguous.
>...
>
> With a GLR parser (or something equivalent in power, e.g. an Earley
> parser or CYK) and a lexer that returns all possible sets of
> tokenizations one can find all the relevant parse trees and then see
> if only 1 makes semantic sense.

Obviously, those who coded such BASIC parsers had some simpler rules,
ie the position of the first 'TO' might be used for the statement 50.
Such ruling may comply with the description given by Dr Martin Ward.

Ev. Drikos

[toc] | [prev] | [next] | [standalone]


#2473

FromChristopher F Clark <christopher.f.clark@compiler-resources.com>
Date2020-03-01 10:07 +0200
Message-ID<20-03-002@comp.compilers>
In reply to#2471
I'm sorry if my reply sounded like a criticism of Ev's post.  I didn't
intend it to be.  It was more pointing out that when one does things
by hand, one often makes things that are challenging to formalize.

I noticed this particular aspect as I was trying to imagine what would
need to add to the lexer generator in Yacc++ (you can use lex (or
flex) with it, but it also includes its own model which has extensions
so that you can write LR rules in your lexer too and a few other
things to minimize conflicts, integrate with a symbol table, etc).
One of our goals with writing it was to capture in a formalizable way
some of the important "hacks" one often used to solve problems (e.g.
how to do PL/I style keywords).

And this problem looked like one where if the hack could be distilled,
it would make sense to do so.  Not that I am a big fan of languages
with optional spaces, but that doesn't mean that they might not
represent something people need to handle.

But John, our moderator, and Dodi are right in a particular sense.
Certain things were done informally by hand "in the old days".  And
this actually points at another possible direction to a solution, PEGs
(parsing expression grammars).  It's a different way of attacking the
ambiguous language problem.  In PEGs, "or" constructs are not
un-ordered, but instead you take them in a specific order the grammar
writer chooses and the first one that applies is the correct one.  You
get somewhat the same effect with lex, where if two rules match you
use the one specified lexically first.  The only difference is that in
lex, the longest match rule overrides that, where in a PEG, the
ordering rule overrides (and you control the ordering at the caller
level, by how you write the "or" not by the order of the rules that
the "or" invokes).

In my ambiguous example, a PEG like construct where the first "TO" is
the rule that matches is probably not that hard to write.

And, PEG rules can actually be added to LL and LR grammars.  They came
from syntactic predicates that Terence Parr invented (and we added
those to Yacc++ by using the idea of "Tunnel Automata" pioneered by
Josef Gro"lsch).  If one goes down that path, it might be possible to
actually capture what people have done by hand.  I don't recall
whether we added them to the lexer portion of the generator.    So,
now it's on my to-do list.

On Sat, Feb 29, 2020 at 11:48 AM Christopher F Clark
<christopher.f.clark@compiler-resources.com> wrote:
>
> "Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution
> to the problem of keywords being part of identifiers in languages with
> optional spaces.
> I won't include it here.
>
> The problem is that some keywords can appear at places other than the
> beginning of an identifier.
> In fact, in the worst case scenario, the language can be ambiguous.
> Consider the following "BASIC" program extended with variables that
> are more than one letter long
> and spaces being optional.
>
> 10 LET ITO = 1
> 20 LET I = 2
> 30 LET JTOK = 3
> 40 LET K = 4
> 50 FOR N = ITOJTOK
> 60 REM AMBIGUOUS FOR N = I TO JTOK
> 70 REM OR FOR N = ITOJ TO K
> 80 PRINT N;
> 90 NEXT N
> 100 END
>
> The problem with such solutions is one is tempted to "fix" them one by
> one as they are encountered.
>
> Maury Markowitz <maury.markowitz@gmail.com> mentioned this in his post
> where ATO was considered.
> It could be A TO or AT O (presuming that TO and AT are both keywords)
> Note that this is even an issue with 1 letter variable names if one
> has both keywords.
>
> As one starts patching up these cases, the "grammar"
> (or its recursive descent implementation most likely)
> begins to become what I call "ad hack".
>
> With a GLR parser (or something equivalent in power, e.g. an Earley
> parser or CYK) and
> a lexer that returns all possible sets of tokenizations one can find
> all the relevant parse
> trees and then see if only 1 makes semantic sense.
>
> In the above example, that won't help as both interpretations are
> legal programs.
> One prints 2 3, the other 1 2 3 4.
>
> I cannot imagine a programmer being happy with the error message:
> LINE 50 AMBIGUOUS STATEMENT.
--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------
[So how do you lex PL/I?  It's not ambiguous, and I think you never need more than
one token lookahead to decide whether a string of letters that looks like a
keyword is instead a variable name as in

IF IF = THEN; THEN ELSE = IF; ELSE THEN = IF;

-John]

[toc] | [prev] | [next] | [standalone]


#2474

From"Ev. Drikos" <drikosev@gmail.com>
Date2020-03-01 19:41 +0200
Message-ID<20-03-003@comp.compilers>
In reply to#2471
On 29/02/2020 21:38, Ev. Drikos wrote:
> On 29/02/2020 11:48, Christopher F Clark wrote:
> ...
> Obviously, those who coded such BASIC parsers had some simpler rules,
> ie the position of the first 'TO' might be used for the statement 50.
>

Dear Mr. Clark,

I'll elaborate a little if you don't mind. IMHO, one problem here is
that an identifier before a keyword is quite difficult for a scanner.
One could possibly let the parser recognize such obscure identifiers.

Let's assume that the first 'TO' found in a for-statement after a '='
is the keyword that separates the lower bound from the upper bound of
the loop index.

As a demo example, the simplified grammar at the end of this message
fails to parse only the last statement (999) below, which looks like
a known limitation of my Simulator. An actually generated C++ parser
would need some time and effort that currently I don't plan to spend:

10  let i=1
11  letleti=1
20  i=1
30  for forj=1 to n
40  FORFORJ=1TON
50  FOR N = ITOJTOK
60  FORFORJ=IIITONTOJ
70  FORFORJ=TO TO TO
80  FORFORJ=TTTON
999 FORFORJ=TOTON


Of course, this is far away from a working solution because one needs
to know a lot of details, ie if any spaces read are indeed important.

What would you suggest apart a hand coded lexer?

Ev. Drikos


----------------------------------------------------------------------
    SYNTAX RULES
----------------------------------------------------------------------

#sma <id>
#sma <idl>
#sma TO

<grm> ::=
            <statements>

<statements> ::=
            <statements> <statement>
      |     <statement>

<statement> ::=
            <for-stmt>
      |     <assignment>
      |     <empty>

<for-stmt> ::=
            <label> FOR <id> = <lbound> TO <ubound> <;>

<lbound> ::=
            <number>
      |     <id-p>

<id-p> ::=
            <id-p> <id-part>
      |     <id-start>

<id-part> ::=
            <letter>
      |     <digit>
      |     $

<id-start> ::=
            <letter>

<ubound> ::=
            <rhs>

<rhs> ::=
            <id>
      |     <number>

<assignment> ::=
            <label> LET <id> = <rhs> <;>
      |     <label> <idl> = <rhs> <;>

<empty> ::=
            <;>



----------------------------------------------------------------------
  LEXICAL CONVENTIONS
----------------------------------------------------------------------

#ignore spaces

#sma <id>
#sma <idl>

token ::=
            spaces
      |     END
      |     FOR
      |     LET
      |     NEXT
      |     PRINT
      |     TO
      |     <digit>
      |     <label>
      |     <number>
      |     <letter>
      |     <idl>
      |     <id>
      |     $
      |     =
      |     <;>

spaces ::=
            { \t | \s }...

END ::=
            E N D

FOR ::=
            F O R

LET ::=
            L E T

NEXT ::=
            N E X T

PRINT ::=
            P R I N T

TO ::=
            T O

<digit> ::=
            0 .. 9

<label> ::=
            { 0 .. 9 }...

<number> ::=
            { 0 .. 9 }...

<letter> ::=
            A .. Z

<idl> ::=
            { A .. Z [{$|A .. Z|0..9}...]} -= {key[{$|A..Z|0..9}...]}

key ::=
            END
      |     FOR
      |     LET
      |     NEXT
      |     PRINT

<id> ::=
            A .. Z [ { $ | A .. Z | 0 .. 9 }... ]

<;> ::=
            ;
      |     \n

[toc] | [prev] | [next] | [standalone]


#2476

FromChristopher F Clark <christopher.f.clark@compiler-resources.com>
Date2020-03-02 08:33 +0200
Message-ID<20-03-005@comp.compilers>
In reply to#2474
"Ev. Drikos" <drikosev@gmail.com> asked some very good questions to
which I don't know if I have good answers.  But, because he is
seriously trying to address the question, I am going to try and help.
I leave it up to him and you (other readers) whether my help is
actually helpful or not. So, these are just my thoughts.

And let's focus on the "TO" question.  I think his partial solution
works fine for the BASIC keywords at the start of statements because
they always need to start the token, but "TO", "THEN", and "ELSE" are
a different problem, They occur mid-statement.

As far as I can tell (and I haven't thought seriously about BASIC for
years), there are roughly two cases.

1) We have already identified the lower bound in "lower bound TO upper
bound".  This can happen if we saw a number, or maybe an identifier
without the word "TO" in it (followed by a space?).  This will not be
true, if we just saw an operator such as "+" as we will be expecting
another identifier to continue the expression.  So, knowing what the
parser expects is definitely important here.

In this case, the "TO" must be the next thing we see.  So, we are back
to the beginning of the statement problem, and we can use the solution
already proposed as is.

2) We need an identifier to complete the expression "lower bound TO
upper bound".  In this case, we need at least one letter (assuming
that we aren't allowed to start identifiers with digits) before the
word "TO" and the "TO" must be followed by a space, a letter or a
digit, but not an operator character.  So, you need a regular
expression that looks something like this /[A-Za-z0-9]+TO[A-Za-z0-9 ]+/
note the extra space in the regex class following the "TO".
Hmm, you might need a left paren in that regex.  The problem here is
that you are basically expecting the lexer to do part of the parser's
job, which is why such things tend to be very messy and likely fraught
with errors or corner cases that are surprising and don't do what you
want.

Now, if you manage to get an appropriate regex, then you need to break
that down into tokens.  If you used a PCRE like regex package to do
so, you might be able to use captures to do that for you.

Note, if you are using the lex/yacc interface where the parser calls
the lexer and expects it to return one token each time, you need to
insert a buffer so that you have a place to push tokens that you want
to return on subsequent calls.  Coroutines are a better solution, but
they have a slightly different API.   We use a coroutine model
(implemented with a buffer) in Yacc++ between our lexer and parser, to
specifically deal with this (and to allow our lexer/parser to work in
event driven systems where the flow of control is inverted).

-----

So, what would I do?  I would do it in 3 parts.  A lexer, a "fixer",
and a parser.  The lexer would just do the easy cases, returning
identifiers separated by spaces (and punctuation), but possibly
pulling off initial keywords at the start of statements (and probably
only then).

The "fixer" would be run on tokens in the middle of the statement and
try to find these hard cases.  It might need to know what the parser
expects to get them right.  If so, I would use a PEG parser (like
packrat?) to get parsing capacity.  If not, or if I can somehow direct
the fixer with info from the parser (or use something like lexer
states), I might use a PCRE scanner to pull things apart.

Curiously, I suspect the fixer for all three special keywords looks
the same.  They all have similar context information.  I don't know if
that makes writing the relevant regex easier.

After the fixer, the parser could probably be simple again.

----

Worth noting is once-upon-a-time, Terence Parr (the PCCTS/ANTLR guy)
proposed and perhaps implemented a structure like this, where there
were "channels" and "filters" that could go between the lexer and the
parser.  (I copied him on this reply in case, he no longer reads this
newsgroup.)

You see a similar problem in implementing a C preprocessor.  You have
tokens for a lexer, but you need to re-process them into different
tokens after doing macro substitution.  So, this is not a one-off
problem.  It reoccurs in different contexts.

However, I think a general solution is likely to be hard to formalize.
That doesn't mean we should stop trying.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]


#2477

From"Ev. Drikos" <drikosev@gmail.com>
Date2020-03-02 20:04 +0200
Message-ID<20-03-006@comp.compilers>
In reply to#2476
On 02/03/2020 08:33, Christopher F Clark wrote:
> ...
>
> As far as I can tell (and I haven't thought seriously about BASIC for
> years), there are roughly two cases.
>
   Neither I have thought seriously about any BASIC Dialect so far. The
example grammar in my previous message took me less than half an hour.

> 1) We have already identified the lower bound in "lower bound TO upper
> bound".  This can happen if we saw a number, or maybe an identifier
> without the word "TO" in it (followed by a space?).  This will not be
> true, if we just saw an operator such as "+" as we will be expecting
> another identifier to continue the expression...

One is supposed to progressively add syntax as needed. I just found such
a case at www.gkc.org.uk/martin/software/UK101-tapes.zip:

5120 FOR J=K+1TON4:A(K,J)=A(K,J)/P:NEXT J:I=1

Here is a more complete rule that parses ie "512 FOR J=LOWER+1TOUPPER"

<lbound> ::=
            <lbound> <rel-op> <lbound> ~: <rel-op>
      |     <lbound> <add-op> <lbound>
      |     <lbound> <mult-op> <lbound>
      |     <add-op> <lbound> -: { <mult-op> }
      |     <number>
      |     <id-p>

IMHO, the most difficult problems come with statements like "NEXT i,j"
that imply a shared end statement of two FOR loops and BASIC doesn't
seem to be as easy as Fortran where this problem has a simple solution:
https://github.com/drikosev/Fortran/blob/master/OMP_Legacy_Parser.txt

> ...
> So, what would I do?  I would do it in 3 parts.  A lexer, a "fixer",
> and a parser...

So, the suggestion is a fixer between the parser and the lexer. Ok, I
appreciate your opinion.

Ev. Drikos

[toc] | [prev] | [next] | [standalone]


#2472

FromHans-Peter Diettrich <DrDiettrich1@netscape.net>
Date2020-03-01 00:28 +0100
Message-ID<20-03-001@comp.compilers>
In reply to#2470
Am 29.02.2020 um 10:48 schrieb Christopher F Clark:

> With a GLR parser (or something equivalent in power, e.g. an Earley
> parser or CYK) and a lexer that returns all possible sets of
> tokenizations one can find all the relevant parse trees and then see
> if only 1 makes semantic sense.

I wonder how those old coders could fit all that into about 4 KB ROM for
their BASIC interpreter ;-)

Isn't there a better solution nowadays than taking a sledgehammer to
crack a nut? The reconstruction of the proper original syntax requires
to disassemble the original code, so nothings hinders one from
implementing just that code on new hardware.

DoDi
[I suppose it depends on whether you want to understand what the old 6502
code did, or just run it.  I'm also not sure I want to emulate whatever
passed for an arithmetic package on those machines. -John]

[toc] | [prev] | [next] | [standalone]


#2456

FromMaury Markowitz <maury.markowitz@gmail.com>
Date2020-02-25 06:11 -0800
Message-ID<20-02-018@comp.compilers>
In reply to#2453
So it seems the most basic answer is "no, it won't do that". Perhaps this should be a feature? "Tokens are immediate terminals" or somesuch.

I originally chose flex/bison because it is widely available on almost any platform (although on Windows it is sub-optimal) and I found a reasonable BASIC starting point online.

In any event, it does seem the practical solution is to make my own lexer. Luckily there are any number that exist already in cross-platform form.

Perhaps you all have some starting points I should look at? I'm most interested in plain C although I am definitely going to look at Jerry's code.
[Writing a lexer for Basic shouldn't be very hard since the token structure is pretty straightforward.  The hardest part
is being sure you detect and deal with invalid input in all the places it can occur.  C lexers tend to be loops with
big switch statements on the input characters, and you need to remember to have a default in all of them. -John]

[toc] | [prev] | [next] | [standalone]


#2459

FromKaz Kylheku <493-878-3164@kylheku.com>
Date2020-02-26 08:06 +0000
Message-ID<20-02-021@comp.compilers>
In reply to#2453
On 2020-02-19, Maury Markowitz <maury.markowitz@gmail.com> wrote:
> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).
>
> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:
>
> 100FORI=1TO10
>
> Here's my current patterns that match bits of this line:
>
> FOR         { return FOR; }
>
> [:,;()\^=+\-*/\<\>]     { return yytext[0]; }
>
> [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
>               yylval.d = atof(yytext);
>               return NUMBER;
>             }
>
> "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
>               yylval.s = g_string_new(yytext);
>               return IDENTIFIER;
>             }
>
> These correctly pick out some parts, numbers and = for instance, so it sees:
>
> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.
>
> Is there a way to represent this in lex? Over on Stack Overflow the only
> suggestion seemed to be to use trailing syntax on the keywords, but that
> appears to require modifying every one of simple patterns for keywords with
> some extra (and ugly) syntax. Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".

Two ideas:

1. Just forget recognizing variable names in the lexer. Instead,
recognize only the constituent letter of a variable name in the lexer.
Then in the parser, have a grammar production which converts
the letters of a variable into a variable.

   variable : VARCHAR
            | variable VARCHAR
            ;

2. Use regex patterns in the lexer to recognize just the keywords,
as a above.  Then, recognition of variable names is handled by
matching just one letter A-Z, whose lex action performs ad-hoc
lexical analysis using C logic. At that point you know that you do not
have a keyword, because no keyword rule matched. You can read
characters using YYIN and accumulate a variable name.

A variant of technique (2) is used for scanning C comments,
as an alternative to an ugly regular expression:

  "/*"  {
          int c;

          while ((c = yyinput()) != 0)
          {
            if (c == '\n') {
              /* increment line number or something */
            }
            else if (c == '*')
            {
              if ((c = yyinput()) == '/')
                break;
              else
                unput(c);
            }
          }
        }

The above is an adaptation of something from an old Flex manual.
IIRC the Dragon Book has a similar example of ad-hoc logic
in a lex rule for handling C comments.

You can see that it's a similar idea. We use a regex to partially match
the comment, just the /* opening. Then we take over from there.

I have a hunch this would work for fetching variables like FORI, when
there is no match on a keyword like FOR.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List:  http://www.kylheku.com/diy
ADA MP-1 Mailing List:   http://www.kylheku.com/mp1

[toc] | [prev] | [next] | [standalone]


#2469 — Re: Languages with optional spaces and tools

FromHans-Peter Diettrich <DrDiettrich1@netscape.net>
Date2020-02-28 20:16 +0100
SubjectRe: Languages with optional spaces and tools
Message-ID<20-02-032@comp.compilers>
In reply to#2453
Am 19.02.2020 um 16:35 schrieb Maury Markowitz:

After reading the entire tree, I feel a need for adding my $0.02:

> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc.

If all you have is a hammer, everything looks like a nail :-(


> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.

Your problem may be the use of lex/yacc! Some older languages are not so
regular and context free and have their own rules for disambiguation.
Such old lexers also can be non-greedy or follow other non-lex rules. So
put aside your hammer and try other tools or your mere fingers...


> [Having written Fortran parsers, not that I've ever found.  I did a prepass
> over each statement to figure out whether it was an assignment or something
> else, then the lexing was straightforward if not pretty. -John]

Right, the FORI problem was observed first with FORTRAN, and also was
solved at that time.

When I learned computer science, in the early 70's, the use of parser
generators was almost restricted to the scientific area, for their own
continued development, not for use in compiler construction. At least
the development of a correct formal grammar for those old languages
often; if possible at all; took more time than handcrafting a compiler
front-end.

Also most languages came with a rather informal description of their
terminals, with a more or less complete formal grammar restricted to
their parser.

DoDi

[toc] | [prev] | [next] | [standalone]


#2478

Fromgah4@u.washington.edu
Date2020-03-02 21:12 -0800
Message-ID<20-03-007@comp.compilers>
In reply to#2453
On Wednesday, February 19, 2020 at 8:24:02 AM UTC-8, Maury Markowitz wrote:
> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc.

> I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).

> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:

> 100FORI=1TO10

The first BASIC systems I ever used were the HP TSB2000 systems
that were very popular in the 1970's.  These systems tokenize each
line on entry, and also syntax check it.  If a line fails, it isn't
even stored. Extra spaces or none, the line is stored the same way.
(And this is the system that many of the original Star Trek games
were written on.) Also, numeric constants are converted to their
internal representation.  The LIST command converts the tokenized form
back to text form, and adds appropriate blank space.

Note also that the TSB2000 systems only allow the traditional
single letter, or number-digit, form for variable names.

The microcomputer BASIC systems that I remember tokenize lines, but
don't do any more checking on them. As noted, blanks are stored and
used when the LIST command displays the program.

Looking at the GW-BASIC manual:

http://www.divonasperi.it/divona/tam/tecnologia/dow-all/GW%20Basic%20(inglese).pdf

and the PC-BASIC manual:

https://robhagemans.github.io/pcbasic/doc/1.2/#memory

(the latter intended to be bug-for-bug compatible with the MS version.)

I don't see in either manual mention of the effects, or lack thereof,
of blanks in statements.  There is explanation of reserved words,
and their possible use in variable names. It seems to me that
the treatment of missing blanks is an accident of the parser design.

Bug-for-bug emulation, then, needs to find all those accidents
and implement them.

[toc] | [prev] | [next] | [standalone]


#2510

FromGene <gene.ressler@gmail.com>
Date2020-04-14 10:08 -0700
Message-ID<20-04-007@comp.compilers>
In reply to#2453
On Wednesday, February 19, 2020 at 11:24:02 AM UTC-5, Maury Markowitz wrote:
> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. ...
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. ...

> Is there a canonical cure for this sort of problem that isn't worse than the
> disease?

Flex regexes alone don't have enough power. But you could add a conditional
REJECT when a normal ID match has a prefix that's a keyword. (This could be
made fast with a trie of that's an issue.) REJECT causes fallback to the "next
best" rule, which seems to be what you want.

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.compilers


csiph-web