Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Sat, 11 Jun 2022 23:45:26 +0300
Organization: Compilers Central
Lines: 48
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-037@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="10937"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, design
Posted-Date: 11 Jun 2022 19:26:13 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3069

Since, Terence borrowed that idea from our version of Yacc++, I feel
qualified to answer.  (And we in turn borrowed plenty of ideas from
ANTLR, so it's all fair.)

First, they aren't really merged.  The Scannerless parser people often
merge the idea, but ANTLR still has them separate and generates a
lexer and a parser as two separate entities and there are slight
differences between them (as there should be).  And, you can actually
do them as separate files (and compile them separately) if you want.

All that happens is that you have the two parts using roughly the same
notation (and if you don't need the lexer specific features, it is
essentially a subset of the parsing language, and vice-versa).  So,
you learn that notation and you know it for both parts.

Moreover, by combining the two parts in one file, you know that the
parts "go together" and you have less problems with mismatches,
especially not the kind where you update one but then have an "old"
version of the other which doesn't quite match.  It also allows you to
introduce "tokens" (especially "keywords") in the parser.  (Note you
lose this if you compile the parts separately.)

A slightly more advanced version of that allows you to have tokens
that are only matched in certain contexts.  ANTLR has some of this
implemented, so if you have a place in your parser where you want to
match ">" but not worry about ">>", you can use the literal token in
the grammar and it will override the longest match rule, provided it
doesn't introduce a conflict.  (You also lose this feature with
separate compilation.)

So, note that by keeping the implementations separate (they are really
two phases), you have kept item 1.  Your parser never sees whitepace
and comments, unless you want it to.  You can still do 2 with the
implementation.  I don't know whether the ANTLR generated lexer does
so (or not).  Since ANTLR is Unicode based, 3 is not an issue.

All of this is why we merged the two files into one in Yacc++ and
ANLTR started doing the same thing.  We also merged them because our
original target was going to be building a syntax directed editor
version of Emacs and these ideas made sense in that regard.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------