Re: Languages with optional spaces

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	Re: Languages with optional spaces
Date	2020-03-02 08:33 +0200
Organization	Compilers Central
Message-ID	<20-03-005@comp.compilers> (permalink)
References	<20-02-015@comp.compilers> <20-02-033@comp.compilers> <20-02-034@comp.compilers> <20-03-003@comp.compilers>

Show all headers | View raw

"Ev. Drikos" <drikosev@gmail.com> asked some very good questions to
which I don't know if I have good answers.  But, because he is
seriously trying to address the question, I am going to try and help.
I leave it up to him and you (other readers) whether my help is
actually helpful or not. So, these are just my thoughts.

And let's focus on the "TO" question.  I think his partial solution
works fine for the BASIC keywords at the start of statements because
they always need to start the token, but "TO", "THEN", and "ELSE" are
a different problem, They occur mid-statement.

As far as I can tell (and I haven't thought seriously about BASIC for
years), there are roughly two cases.

1) We have already identified the lower bound in "lower bound TO upper
bound".  This can happen if we saw a number, or maybe an identifier
without the word "TO" in it (followed by a space?).  This will not be
true, if we just saw an operator such as "+" as we will be expecting
another identifier to continue the expression.  So, knowing what the
parser expects is definitely important here.

In this case, the "TO" must be the next thing we see.  So, we are back
to the beginning of the statement problem, and we can use the solution
already proposed as is.

2) We need an identifier to complete the expression "lower bound TO
upper bound".  In this case, we need at least one letter (assuming
that we aren't allowed to start identifiers with digits) before the
word "TO" and the "TO" must be followed by a space, a letter or a
digit, but not an operator character.  So, you need a regular
expression that looks something like this /[A-Za-z0-9]+TO[A-Za-z0-9 ]+/
note the extra space in the regex class following the "TO".
Hmm, you might need a left paren in that regex.  The problem here is
that you are basically expecting the lexer to do part of the parser's
job, which is why such things tend to be very messy and likely fraught
with errors or corner cases that are surprising and don't do what you
want.

Now, if you manage to get an appropriate regex, then you need to break
that down into tokens.  If you used a PCRE like regex package to do
so, you might be able to use captures to do that for you.

Note, if you are using the lex/yacc interface where the parser calls
the lexer and expects it to return one token each time, you need to
insert a buffer so that you have a place to push tokens that you want
to return on subsequent calls.  Coroutines are a better solution, but
they have a slightly different API.   We use a coroutine model
(implemented with a buffer) in Yacc++ between our lexer and parser, to
specifically deal with this (and to allow our lexer/parser to work in
event driven systems where the flow of control is inverted).

-----

So, what would I do?  I would do it in 3 parts.  A lexer, a "fixer",
and a parser.  The lexer would just do the easy cases, returning
identifiers separated by spaces (and punctuation), but possibly
pulling off initial keywords at the start of statements (and probably
only then).

The "fixer" would be run on tokens in the middle of the statement and
try to find these hard cases.  It might need to know what the parser
expects to get them right.  If so, I would use a PEG parser (like
packrat?) to get parsing capacity.  If not, or if I can somehow direct
the fixer with info from the parser (or use something like lexer
states), I might use a PCRE scanner to pull things apart.

Curiously, I suspect the fixer for all three special keywords looks
the same.  They all have similar context information.  I don't know if
that makes writing the relevant regex easier.

After the fixer, the parser could probably be simple again.

----

Worth noting is once-upon-a-time, Terence Parr (the PCCTS/ANTLR guy)
proposed and perhaps implemented a structure like this, where there
were "channels" and "filters" that could go between the lexer and the
parser.  (I copied him on this reply in case, he no longer reads this
newsgroup.)

You see a similar problem in implementing a C preprocessor.  You have
tokens for a lexer, but you need to re-process them into different
tokens after doing macro substitution.  So, this is not a one-off
problem.  It reoccurs in different contexts.

However, I think a general solution is likely to be hard to formalize.
That doesn't mean we should stop trying.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Languages with optional spaces Maury Markowitz <maury.markowitz@gmail.com> - 2020-02-19 07:35 -0800
  Re: Languages with optional spaces Jerry <awanderin@gmail.com> - 2020-02-20 23:38 -0700
    Re: Languages with optional spaces Maury Markowitz <maury.markowitz@gmail.com> - 2020-02-25 06:13 -0800
      Re: Languages with optional spaces awanderin <awanderin@gmail.com> - 2020-02-26 10:03 -0700
    Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-03-12 17:45 +0200
  Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-02-23 12:33 +0200
    Re: Languages with optional spaces Martin Ward <martin@gkc.org.uk> - 2020-02-25 17:00 +0000
      Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-02-28 13:34 +0200
    Re: Languages with optional spaces Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-02-29 11:48 +0200
      Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-02-29 21:38 +0200
        Re: Languages with optional spaces Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-03-01 10:07 +0200
        Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-03-01 19:41 +0200
          Re: Languages with optional spaces Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-03-02 08:33 +0200
            Re: Languages with optional spaces "Ev. Drikos" <drikosev@gmail.com> - 2020-03-02 20:04 +0200
      Re: Languages with optional spaces Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-03-01 00:28 +0100
  Re: Languages with optional spaces Maury Markowitz <maury.markowitz@gmail.com> - 2020-02-25 06:11 -0800
  Re: Languages with optional spaces Kaz Kylheku <493-878-3164@kylheku.com> - 2020-02-26 08:06 +0000
  Re: Languages with optional spaces and tools Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-02-28 20:16 +0100
  Re: Languages with optional spaces gah4@u.washington.edu - 2020-03-02 21:12 -0800
  Re: Languages with optional spaces Gene <gene.ressler@gmail.com> - 2020-04-14 10:08 -0700
  Re: Languages with optional spaces mertesthomas@gmail.com - 2020-04-19 04:04 -0700
  Re: Languages with optional spaces aston.goldsmith@gmail.com - 2020-05-05 13:05 -0700

csiph-web