Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Newsgroups: comp.compilers
Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Fri, 10 Jun 2022 12:26:38 +0200
Organization: Compilers Central
Lines: 28
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-032@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="89359"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, design, comment
Posted-Date: 10 Jun 2022 11:43:47 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-023@comp.compilers>
Xref: csiph.com comp.compilers:3065

On 6/9/22 4:52 PM, Roger L Costello wrote:

> Those seem like compelling reasons for separating the lexical analysis from
> parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?

I cannot answer the ANTLR question but want to point out why IMO
(traditional) languages based on tokens should have a tokenizer in the
first place.

We encountered a problem with scannerless parsers and traditional
programming languages in Meta§. A tokenizer can use a couple of
whitespace or control characters or tokens as token separators. Without
such a rule it's very hard to flag all places in a grammar where
whitespace is *required* as a token separator.

As an example: in an "else if" sequence whitespace is required between
"else" and "if" while in other context e.g. "else{" whitespace is not
required. This problem may be solved by a regex, but why should a
grammar be inflated by adding a token termination clause to *each* keyword?

Fortran is another example where keyword recognition requires
backtracking or similar means due to ignorance of spaces e.g. in the
well known "DO10I = 1," snippet.

DoDi
[A separate lexer also makes it a lot easier to skip comments.  For Fortran
you had to do a prepass to see whether a statement was an assignment or something
else but after that you could tokenize without backtracking. -John]