Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: RE: What stage should entities be resolved? Date: Thu, 10 Mar 2022 12:54:01 +0200 Organization: Compilers Central Lines: 84 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-026@comp.compilers> References: <22-03-019@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="38032"; mail-complaints-to="abuse@iecc.com" Keywords: design Posted-Date: 11 Mar 2022 14:49:25 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2930 There is a simple rule of thumb that answers your question. The lexer deals with all issues related to characters. The parser deals with all issues related to tokens. It should rarely look at the spelling (characters) of a token. The semantic analysis phase should deal with all issues related to the AST. It should rarely look at individual tokens. At each phase, the preceding phase should have resolved the issues at the lower level and at worse converted them into a unique number associated with the output units. So, for example from lexer to parser: That is a character string can be distinguished from an integer from an identifier from a keyword after the lexer and each character string and integer and identifier should have a unique id number, but the value of the integer should rarely be looked at by the parser (except in some very odd and rare cases, e.g. only an integer zero might be allowed in certain contexts, so zero can be distinguished from 42 but the parser doesn't need to know the value of 42, just that it is a different value than 43. So, 42 and 43 should be the same kind of token, just with different unique ids. And if 42.0 and 42.000 are the same, then they should be the same kind of tokens (and may or may not need to have the same unique id). You may or may not care about that distinction until much later, e.g. in the optimizer or code generator. Similarly, from parser to semantic analysis. The trees built should be identified such that the semantic analysis doesn't generally look at the specific tokens, just the trees themselves. Is this an expression or a statement? Does the "type" of the expression map the "type" of variable that is being assigned to, or do we need to insert a conversion or signal a semantic error? Each phase has its own complexities. You don't want to be carrying the complexities of one phase over to the next. Sometimes you have no choice but to do so, but you want to make that rare. That said, you also want to keep each phase as simple as possible. Notice how in my previous posting I mentioned splitting the scanner and the screener into separate parts. So, it might be that resolving special character constructs (e.g. &) might be done outside the scanner per se, but still part of lexing). Figure out what makes the code easiest to read. And, with that implemented and having a working system, then do the measurements that tell you whether you have a bottleneck somewhere that you need to fix (possibly by putting code together that was originally distinct). But having a working and understandable first cut should be your first priority. Then, you have something you can use to validate your refactorings against and make certain you haven't taken an invalid shortcut that breaks something. Moreover, most of the time, the simplest code is also the fastest. You will likely spend more time in the scanner than nearly any other phase, because it has to touch each character. So, the less is has to do with the majority of tokens the better. Simple code that gets the text divided into tokens and roughly categorizes them will be fast at doing so. Then if you have tokens like character strings that need special processing, recognizing "escape sequences", you can fix them up in the code that deals only with character strings, and before you hand those strings off to the parser. Just as you can map identifiers into keywords after figuring out what the identifier is. This is the point of having a "screener" after the scanner, but part of the lexer. ------- One final comment, related to a different post but relevant here. There are times, you might push the result later. The "go" "to" versus "goto" example might be such a case. You can easily make the parser reconize both a two keyword sequence "go" "to" and a one keyword sequence "goto" are synonymous and trying to patch the tokens back in the lexer can be more complex, e.g. some bizarre dialect of BASIC: if i < 5 go to 100; versus for i = go to end { a[i] =2*i } - ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------