Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? Date: Sat, 11 Jun 2022 23:45:26 +0300 Organization: Compilers Central Lines: 48 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-06-037@comp.compilers> References: <22-06-023@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="10937"; mail-complaints-to="abuse@iecc.com" Keywords: lex, parse, design Posted-Date: 11 Jun 2022 19:26:13 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3069 Since, Terence borrowed that idea from our version of Yacc++, I feel qualified to answer. (And we in turn borrowed plenty of ideas from ANTLR, so it's all fair.) First, they aren't really merged. The Scannerless parser people often merge the idea, but ANTLR still has them separate and generates a lexer and a parser as two separate entities and there are slight differences between them (as there should be). And, you can actually do them as separate files (and compile them separately) if you want. All that happens is that you have the two parts using roughly the same notation (and if you don't need the lexer specific features, it is essentially a subset of the parsing language, and vice-versa). So, you learn that notation and you know it for both parts. Moreover, by combining the two parts in one file, you know that the parts "go together" and you have less problems with mismatches, especially not the kind where you update one but then have an "old" version of the other which doesn't quite match. It also allows you to introduce "tokens" (especially "keywords") in the parser. (Note you lose this if you compile the parts separately.) A slightly more advanced version of that allows you to have tokens that are only matched in certain contexts. ANTLR has some of this implemented, so if you have a place in your parser where you want to match ">" but not worry about ">>", you can use the literal token in the grammar and it will override the longest match rule, provided it doesn't introduce a conflict. (You also lose this feature with separate compilation.) So, note that by keeping the implementations separate (they are really two phases), you have kept item 1. Your parser never sees whitepace and comments, unless you want it to. You can still do 2 with the implementation. I don't know whether the ANTLR generated lexer does so (or not). Since ANTLR is Unicode based, 3 is not an issue. All of this is why we merged the two files into one in Yacc++ and ANLTR started doing the same thing. We also merged them because our original target was going to be building a syntax directed editor version of Emacs and these ideas made sense in that regard. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------