Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Jerry Newsgroups: comp.compilers Subject: Re: Languages with optional spaces Date: Thu, 20 Feb 2020 23:38:51 -0700 Organization: A noiseless patient Spider Lines: 394 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-02-016@comp.compilers> References: <20-02-015@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="41015"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 21 Feb 2020 11:33:07 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2454 Maury Markowitz writes: > I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs > like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk > of 101 BASIC Games (DEF FN is the last feature to add). > > Then I got to Super Star Trek. To save memory, SST removes most spaces, so > lines look like this: > > 100FORI=1TO10 > > Here's my current patterns that match bits of this line: > > FOR { return FOR; } > > [:,;()\^=+\-*/\<\>] { return yytext[0]; } > > [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? { > yylval.d = atof(yytext); > return NUMBER; > } > > "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? { > yylval.s = g_string_new(yytext); > return IDENTIFIER; > } > > These correctly pick out some parts, numbers and = for instance, so it sees: > > 100 FORI = 1 TO 10 > > The problem is that FORI part. Some BASICs allow variable names with more than > two characters, so in theory, FORI could be a variable. These BASICs outlaw > that in their parsers; any string that starts with a keyword exits then, so > this would always parse as FOR. In lex, FORI is longer than FOR, so it returns > a variable token called FORI. > > Is there a way to represent this in lex? Over on Stack Overflow the only > suggestion seemed to be to use trailing syntax on the keywords, but that > appears to require modifying every one of simple patterns for keywords with > some extra (and ugly) syntax. Likewise, one might modify the variable name > pattern, but I'm not sure how one says "everything that doesn't start with one > of these other 110 patterns". > > Is there a canonical cure for this sort of problem that isn't worse than the > disease? > [Having written Fortran parsers, not that I've ever found. I did a prepass > over each statement to figure out whether it was an assignment or something > else, then the lexing was straightforward if not pretty. -John] I have an unfinished compiler for Applesoft BASIC (the Apple II BASIC written by Microsoft) that I wrote in Python using PLY (a Yacc-like tool). Since, as you discovered, the BASICs of the 1970s and 1980s often didn't use spaces to separate tokens, I tossed the PLY lexer module out the window and wrote the lexer all myself to interface to the parser generator. I don't have the code up on any website so I'll paste it here and if it provides you with any insight, then great. Essentially, what it does is at every point in the tokenizer, it tries to find a token. This means removing spaces and searching the token table. For certain tokens ("ATN", "AT", "TO"), spaces _are_ significant, and the original parser (in 6502 code) had special case code to handle them. My parser mimics what the original did as well. What my lexer does that the original did not do is convert numbers and strings into lexemes. The original interpreter rescanned these at run-time everytime it saw them. I wanted to hand the parser entire numbers and strings. Here is the code: ----- #!/usr/bin/env python3 """Applesoft compiler lexical analyzer.""" import re import sys def escape_chars(text): """Escape regular-expression special characters.""" for char in '$^+(*=': text = text.replace(char, '\\' + char) return text regular_keywords = [ 'END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR', 'TEXT', 'PR#', 'IN#', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2', 'HGR', 'HCOLOR=', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT=', 'SCALE=', 'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE', 'FLASH', 'COLOR=', 'POP', 'VTAB', 'HIMEM:', 'LOMEM:', 'ONERR', 'RESUME', 'RECALL', 'STORE', 'SPEED=', 'LET', 'GOTO', 'RUN', 'IF', 'RESTORE', '&', 'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT', 'LOAD', 'SAVE', 'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR', 'GET', 'NEW', 'TAB(', 'TO', 'FN', 'SPC(', 'THEN', 'NOT', 'STEP', '+', '-', '*', '/', '^', 'AND', 'OR', '>', '=', '<', 'SGN', 'INT', 'ABS', 'USR', 'FRE', 'SCRN(', 'PDL', 'POS', 'SQR', 'RND', 'LOG', 'EXP', 'COS', 'SIN', 'TAN', 'PEEK', 'LEN', 'STR$', 'VAL', 'ASC', 'CHR$', 'LEFT$', 'RIGHT$', 'MID$'] # AT keyword is handled specially # "ATN" ==> but "AT N" ==> "N" # "ATO" ==> "A" but "AT O" ==> "O" special_keywords_re = re.compile( r'(A *T(?![NO]))|(A *TN)|(A(?= *T *O))', re.IGNORECASE) keywords_re = re.compile('({})'.format( '|'.join([' *'.join(escape_chars(char) for char in kw) for kw in regular_keywords])), re.IGNORECASE) integer_re = re.compile(r'[0-9][0-9 ]*') float_re = re.compile(r'[\d ]*\.[\d ]*(E *[-+]?[\d ]*)?', re.IGNORECASE) data_re = re.compile(r'[^":\n]*("[^\n]*")?[^":\n]*(?=[:\n])') remark_re = re.compile(r'[^\n]*') variable_re = re.compile(r'([A-Za-z] *[A-Za-z0-9 ]*[\$%]?)') string_re = re.compile(r'"([^"]*)" *') space_re = re.compile(r' +') keyword_map = { '&': 'AMPER', '*': 'Times', '+': 'Plus', '-': 'Minus', '/': 'Divide', '<': 'Less', '=': 'Equal', '>': 'Greater', 'CHR$': 'CHR', 'COLOR=': 'COLOR', 'HCOLOR=': 'HCOLOR', 'HIMEM:': 'HIMEM', 'IN#': 'IN', 'LEFT$': 'LEFT', 'LOMEM:': 'LOMEM', 'MID$': 'MID', 'PR#': 'PR', 'RIGHT$': 'RIGHT', 'ROT=': 'ROT', 'SCALE=': 'SCALE', 'SCRN(': 'SCRN', 'SPC(': 'SPC', 'SPEED=': 'SPEED', 'STR$': 'STR', 'TAB(': 'TAB', '^': 'Power'} tokens = ('END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR', 'TEXT', 'PR', 'IN', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2', 'HGR', 'HCOLOR', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT', 'SCALE', 'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE', 'FLASH', 'COLOR', 'POP', 'VTAB', 'HIMEM', 'LOMEM', 'ONERR', 'RESUME', 'RECALL', 'STORE', 'SPEED', 'LET', 'GOTO', 'RUN', 'IF', 'RESTORE', 'AMPER', 'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT', 'LOAD', 'SAVE', 'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR', 'GET', 'NEW', 'TAB', 'TO', 'FN', 'SPC', 'THEN', 'AT', 'NOT', 'STEP', 'AND', 'OR', 'SGN', 'INT', 'ABS', 'USR', 'FRE', 'SCRN', 'PDL', 'POS', 'SQR', 'RND', 'LOG', 'EXP', 'COS', 'SIN', 'TAN', 'ATN', 'PEEK', 'LEN', 'STR', 'VAL', 'ASC', 'CHR', 'LEFT', 'RIGHT', 'MID', 'Greater', 'Equal', 'Less', 'Plus', 'Minus', 'Times', 'Divide', 'Power', 'LParen', 'RParen', 'String', 'Newline', 'Variable', 'Colon', 'Comma', 'Semi', 'Float', 'Integer',) literals_re = re.compile(r'([:;,\(\)\n])') literals_map = { ':': 'Colon', ';': 'Semi', ',': 'Comma', '(': 'LParen', ')': 'RParen', '\n': 'Newline'} class Token: """PLY Token: expected lexer return value for parser.""" def __init__(self, token_type=None, value=None): self.type = token_type self.value = value self.lineno = None self.lexpos = None def __repr__(self): if self.value is not None: return '%s[%s]' % (self.type, repr(self.value)) else: return self.type def __eq__(self, other): return self.type == other.type and self.value == other.value def filter_text(text): """Upper-case and filter out spaces from text.""" return text.replace(' ', '').upper() class Lexer: """ Lexer emulates quirky Applesoft tokenizer as closely as possible. Thus, it diverges from what a normal lexer would do. In particular, spaces are completely insignificant, except within strings, except if it finds "AT" followed by a space and then an "N" appears. That is parsed as "AT N", not "ATN". Fun? ;) """ def __init__(self, debug=False): self.s = '' # input source string self.ix = 0 self.DEBUG = debug def input(self, input_str): """Feed the lexical analyzer.""" self.s = input_str self.ix = 0 def debug(self, token): """Generate debug output.""" if self.DEBUG: print(token) def find_keyword(self, offset=0): """ Find a keyword. Handle the AT keyword in the same way as Applesoft: - "ATN" ==> ATN - "AT N" ==> AT N - "ATO" ==> A TO This special case returns Variable:A. - "AT O" ==> AT O Return Token and input-size consumed on success. Return None, 0 if no token found. """ t = Token() m_keyword = keywords_re.match(self.s[self.ix + offset:]) if m_keyword: t.type = filter_text(m_keyword.group(0)) length = m_keyword.end() if t.type in keyword_map: t.value = t.type = keyword_map[t.type] elif t.type == 'DATA': m_data = data_re.match(self.s[self.ix + offset + length:]) assert m_data, 'DATA statement is munged' t.value = m_data.group(0) length += m_data.end() elif t.type == 'REM': m_remark = remark_re.match(self.s[self.ix + offset + length:]) assert m_remark, 'REM statement is munged' t.value = m_remark.group(0) length += m_remark.end() return t, length m_special_keyword = special_keywords_re.match( self.s[self.ix + offset:]) if m_special_keyword: if m_special_keyword.group(1): # AT t.type = 'AT' elif m_special_keyword.group(2): # ATN t.type = 'ATN' elif m_special_keyword.group(3): # ATO ==> A (TO is parsed later) t.type = 'Variable' t.value = 'A' else: assert False, 'broken special_keyword: {}'.format( m_special_keyword.groups()) length = m_special_keyword.end() return t, length return None, 0 def handle_literal(self, match): """Handle literal token.""" t = Token(literals_map[match.group(1)]) self.ix += match.end() self.debug(t) return t def handle_string(self, match): """Handle string token.""" t = Token('String', match.group(1)) self.ix += match.end() self.debug(t) return t def handle_float(self, match_float): """ Handle floating-point token. Applesoft allows odd floats that must be converted: . ==> 0.0 .E ==> 0.0 123E ==> 123.0 """ f = filter_text(match_float.group(0)) if f.endswith('E'): f = f[:-1] if f == '.': f = '0.0' t = Token('Float', float(f)) self.ix += match_float.end() self.debug(t) return t def handle_integer(self, match_int): """Handle integer token.""" t = Token('Integer', int(filter_text(match_int.group(0)))) self.ix += match_int.end() self.debug(t) return t def token(self): """ Return a token to the parser. The Applesoft parser tokenizes keywords, identifies string literals, and leaves everything else as plain text (numbers, variables, and punctuation). This parser treats the keywords the same way, but also creates special tokens for strings, integers, and floating-point numbers. Look for things in this order: - keywords - special keywords (ATN AT TO) - punctuation (literals) - strings - floats - integers - variables """ m = space_re.match(self.s[self.ix:]) if m: self.ix += m.end() if self.ix >= len(self.s): return None t, length = self.find_keyword() if t: self.ix += length self.debug(t) return t m = literals_re.match(self.s[self.ix:]) if m: return self.handle_literal(m) m = string_re.match(self.s[self.ix:]) if m: return self.handle_string(m) m_float = float_re.match(self.s[self.ix:]) if m_float: return self.handle_float(m_float) m_int = integer_re.match(self.s[self.ix:]) if m_int: return self.handle_integer(m_int) t = Token() m = variable_re.match(self.s[self.ix:]) if m: t.type = 'Variable' # If a keyword is found in any suffix of the variable # name, exclude that suffix. for i in range(1, m.end()): token_ahead, length = self.find_keyword(i) if token_ahead: t.value = filter_text(self.s[self.ix: self.ix + i]) assert t.value, 'Bad variable parse: {}'.format( self.s[self.ix: self.ix + m.end()]) self.ix += i self.debug(t) return t t.value = filter_text(m.group(0)) self.ix += m.end() self.debug(t) return t t.type = 'Char' t.value = self.s[self.ix] self.ix += 1 self.debug(t) return t -- Jerry awanderin at gmail dot com