Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Hans-Peter Diettrich Newsgroups: comp.compilers Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them? Date: Fri, 10 Jun 2022 12:26:38 +0200 Organization: Compilers Central Lines: 28 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-06-032@comp.compilers> References: <22-06-023@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="89359"; mail-complaints-to="abuse@iecc.com" Keywords: lex, parse, design, comment Posted-Date: 10 Jun 2022 11:43:47 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <22-06-023@comp.compilers> Xref: csiph.com comp.compilers:3065 On 6/9/22 4:52 PM, Roger L Costello wrote: > Those seem like compelling reasons for separating the lexical analysis from > parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them? I cannot answer the ANTLR question but want to point out why IMO (traditional) languages based on tokens should have a tokenizer in the first place. We encountered a problem with scannerless parsers and traditional programming languages in Meta§. A tokenizer can use a couple of whitespace or control characters or tokens as token separators. Without such a rule it's very hard to flag all places in a grammar where whitespace is *required* as a token separator. As an example: in an "else if" sequence whitespace is required between "else" and "if" while in other context e.g. "else{" whitespace is not required. This problem may be solved by a regex, but why should a grammar be inflated by adding a token termination clause to *each* keyword? Fortran is another example where keyword recognition requires backtracking or similar means due to ignorance of spaces e.g. in the well known "DO10I = 1," snippet. DoDi [A separate lexer also makes it a lot easier to skip comments. For Fortran you had to do a prepass to see whether a statement was an assignment or something else but after that you could tokenize without backtracking. -John]