Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: luser droog <luser.droog@gmail.com>
Newsgroups: comp.compilers
Subject: Re: Why does the lexer convert text integer lexemes to binary integers? I thought that lexers should be simple?
Date: Thu, 21 Jul 2022 14:16:02 -0700 (PDT)
Organization: Compilers Central
Lines: 47
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-044@comp.compilers>
References: <22-07-011@comp.compilers> <22-07-030@comp.compilers> <22-07-036@comp.compilers> <22-07-040@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="78207"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, design
Posted-Date: 22 Jul 2022 13:01:14 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3133

On Wednesday, July 20, 2022 at 7:45:07 PM UTC-5, gah4 wrote:
> On Monday, July 18, 2022 at 9:30:51 AM UTC-7, gah4 wrote:
>
> (snip, or moderator wrote)
> > [In my experience separating the lexer from the parser makes it a lot easier
> > to deal with common lexical situations like skipping white space and comments.
> > You could certainly do that in a combined scheme but I'm not sure it would end
> > up any simpler. -John]
> Interesting. As I previously noted, STEP mostly doesn't do a separate lexical analysis.
>
> It does, however, do three things before the macros see the input: convert multiple
> blanks to a single blank, pass apostrophed strings through whole, and remove
> comments delimited by double quotes.
>
> Apostrophed strings are slightly more interesting. Internal double apostrophes
> are converted to single apostrophes, and the delimiting apostrophes are
> converted to a special character that isn't an input character.
>
> One of my projects 45 years ago, was to write macros to recognize the
> syntax of IBM OS/360 Fortran IV. Direct access I/O statements use
> a single apostrophe to delimit the record number:
>
> READ(1'N) X,Y,Z
>
> There is no way to write macros for that syntax after the previous processing.
>
> Much fun figuring out all the strange things done in programming language
> syntax over the years.

This approach appears to offer a very nice simplification for most Algol-style
languages. But removing the white space entirely makes it harder (or impossible)
to parse languages like Python and Haskell which use the "offside rule" to
interpret the white space as delimiting multi-line constructs.

I haven't solved the above completely, but I've been building my parser combinators with
an eye towards supporting significant white space in the syntax analysis, while
mostly ignoring it.

It's parsers all the way down, but the parsers are designed to operate over lists,
so the infrastructure is agnostic as to the actual type of the elements of the list.
So, you can build the lexical analysis layer as a graph of parsers that work on
lists of integers (characters) and produce Symbol objects. The syntax analysis
layer can then be built as a graph of parsers that work on lists of Symbols.

The symbol object has an extra slot for stashing extra data, so the white space
can be captured and then hidden from the syntax analysis (unless some handler
or predicate function wants to peek in there).