Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <480-992-1380@kylheku.com> Newsgroups: comp.compilers Subject: Re: What stage should entities be resolved? Date: Fri, 18 Mar 2022 17:50:22 -0000 (UTC) Organization: A noiseless patient Spider Lines: 194 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-037@comp.compilers> References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="29574"; mail-complaints-to="abuse@iecc.com" Keywords: design Posted-Date: 18 Mar 2022 14:00:37 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2939 On 2022-03-15, Roger L Costello wrote: > But if PI is inside a quoted string: > > "Today is PI day" > > then the preprocessor does not replace PI. This was not historically true, though. I know for a fact that ancient versions of the C preprocessor would have replaced PI in character constants, as in 'PI'. Possibly, some versions of it also didn't care about quotes. > 1. How much knowledge of the language should the preprocessor stage have? A preprocessor doesn't have to have any knowledge of the language being preprocessed. If the preprocessor is universal, like m4, that is practically a given. If a preprocessor is dedicated to a language, like the C one, it's a good idea for it to work at the token level: tokenize its input the same way, or at least very similarly, to the target language. In C, a preprocessor token is a kind of superset of a proper token. There are pp-number tokens that are not number tokens, for instance, but not vice versa. > 2. How much knowledge of the language should the lexical analysis stage have? > > 3. How much knowledge of the language should the syntax analysis stage have? The traditional lexical/syntax division comes from a blueprint for programming language processing laid down in the work of computer scientists done in the 1960's and 1970's. It is not necessary; a context-free grammar can handle the entire syntax down to the token level, like a decimal integer being a sequence of digits and whatnot. The division into lexical analysis and parsing comes from several pragmatic observations: 1. There are dedicated algorithms that are good for tokenizing. Tokens exhibit regular syntax: they can be recognized using finite automata (regular expressions). 2. There are efficient parsing techniques like LALR(1) that assume that there is one symbol of lookahead. If you use LALR(1) with a scanner, you get one *token* of lookahead. If you use LALR(1) without a scanner (token == character) then you get one *character* of lookahead, which is going to be crippling. Thus, separating lexical analysis effectively amplifies LALR(1) into LALR(k), for some k related to maximum token character length. E.g. if you design a C parser without a scanner, over the character level syntax, when that parser sees the character "do", it has no idea whether it's looking at the "do" keyword, the start of the "double" keyword, or something else like "dongle". One character of lookahead will confirm or eliminate "do", but not resolve "double" versus "dongle". So to answer the questions, if you're assuming that you're going to be using the traditional framework, with a regular-expression-driven lexer and a LR parser with 1 token of lookahead, the way you divide the work, roughly, is by identifying the token-like elements in the language that have regular syntax. Anything that exhibits nesting, requiring rule recursion, will be farmed off to the parser. Your decision could sometimes also be informed by the lookahead concern; you could choose to clump together several items that might plausibly be individual tokens into a "supertoken" if there is some parsing advantage in it, or other simplification. Other kinds of decisions interact with the language definition. For instance, what is -1234? In Common Lisp, and most other Lisp dialect, I suspect, that is a token denoting a negative integer. It may not be written - 1234. In C, and languages imitating its syntax, -1234 is a unary expression: the unary - operator applied to the integer 1234, and so there are two tokens. They may be separated by whitespace, or a comment. This strikes at the language definition, becuase what it implies is that C does not have negative constants. And that causes a subtle problems. Say the int type is 32 bits wide. The most negative 32 bit two's complement integer is -2147483648. But in C this is actually the negation of 2147483648 which does not fit into the type int. The resulting constant expression will not have type int. Defining the INT_MIN constant in such that the expression has type int, and doesn't contain any casts requires something like: #define INT_MIN (-2147483647 - 1) > 4. How much knowledge of the language should the semantic analysis stage have? All knowledge that the previous stages do not have, minus knowledge that only the application domain has. :) > > To make the questions concrete, consider this XML: > > > > Should the lexical analysis stage know that the foo in is a > start tag (STAG) and the foo in is an end tag (ETAG)? XML has structure inside tags, namely attributes. Various implementation choices present themselves. One possibility is to recognize the special cases , and as a special token categories, e.g ELEM_START_TRIVIAL ELEM_END ELEM_COMPLETE All three token categories will carry a semantic attribute on the token which gives the identity of the element, e.g "abc". Then the case like The " could produce ELEM_START_OPEN followed immediately by ELEM_START_CLOSE. > Or should the lexical analysis stage simply identify > the foo in as a name (NAME) and the foo in as a name > (NAME)? That would mean the lexical analysis stage has lesser > knowledge of the XML language. How much knowledge of the XML language > should the lexical analysis stage have? I would say that because the attributes have special syntax with quoting like foo="bar", it would be better for the lexer to be aware of the tag structure, so that it can switch to a mode for precisely handling attributes. Consider foo="bar". Do you want to tokenize the attribute foo="bar" in the same way as the interior of the element? Chances are you want to turn the entire content of the element, into a single text datum, if it doesn't contain any sub-elements. The quotes and equal sign are not special in any way. Also, attributes can have angle brackets; so that is to say, I think is valid. If the tokenizer is oblivious to all this, it will lead to ugly complexity in the parser. As a final general remark, languages are often defined at multiple description levels which can be identified as lexical and syntactic. If you're implementing, it behooves you to follow those separations in the spec, except when your own experience tells you to adjust the interpretation here and there; like you recognize that some syntactic aspect is nicely handled by your scanner or vice versa. The W3C's definition of XML uses one big grammar, without regard for separation into lexing and parsing. However, you can recognize regular elements in it. Consider this grammar rule: Comment ::= '' where: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. * Note that this is entirely regular: we can convert the Comment match entirely into the rules for a lexical analyzer, and therefore Comment can be a token. So that would be the main principle here: scan the tree structur of the XML grammar and look for the tallest subtrees that are entirely regular, considering those to be token candidates. Or something like that. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal