Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: "matt.ti...@gmail.com" Newsgroups: comp.compilers Subject: Re: What stage should entities be resolved? Date: Sun, 20 Mar 2022 07:32:14 -0700 (PDT) Organization: Compilers Central Lines: 62 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-044@comp.compilers> References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="70947"; mail-complaints-to="abuse@iecc.com" Keywords: C, syntax Posted-Date: 20 Mar 2022 15:10:23 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <22-03-032@comp.compilers> Xref: csiph.com comp.compilers:2946 On Thursday, 17 March 2022 at 14:41:47 UTC-4, Roger L Costello wrote: > For instance, as I understand it a C preprocessor goes through a C program and replaces macros. With this: The C preprocessor is a completely different language. It compiles (many would say "transpiles" these days) a C-with-macros file into a C-without-macros file. C, therefore, is a composition of two languages. I think this turned out to be a *very bad idea*, except that we didn't know enough about what the preprocessor would be used for to put these features in the real language. Either way, you should not do anything like this, and you should not think of what the lexer does, or what the parser does, or what semantic analysis does, as any sort of "text replacement". The output of the lexer is a token stream, and the output of the parser is an AST, or some other intermediate representation that is richer than a token stream. Text needs to be parsed, so it any of these stages produce intermediate text, you have to start again at lexing. > Similarly, in XML if & is embedded inside a CDATA section: > > > > then a preprocessor must not replace & with &. That is, the preprocessor must have knowledge about the language: If an XML entity is within a CDATA section, then don’t replace it. Yeah, a preprocessing stage (which would require its own lexer and parser) is not useful. Don't do it. A lexer can produce a rich token like ENTITY_REF("&"), which has enough information in it to be useful in either context. > 2. How much knowledge of the language should the lexical analysis stage have? > 3. How much knowledge of the language should the syntax analysis stage have? As much as you need. Don't worry about it. All of these things are made with deep knowledge of the language being parsed. The lexer is simple and fast and divides the text into labelled atomic units. Use whatever division is convenient. The only real restriction is that the lexer should not produce anything that isn't atomic (may need to be subdivided later), or that requires a rich internal structure (because you don't want to run token text through another parser), or that can't be recognized with a regular language (because that's how lexers work). > Should the lexical analysis stage know that the foo in is a > start tag (STAG) and the foo in is an end tag (ETAG)? That > would mean the lexical analysis stage has considerable knowledge of > the XML language. Or should the lexical analysis stage simply identify > the foo in as a name (NAME) and the foo in as a name > (NAME)? Neither. Start tags can have a rich internal structure like . To provide access to that structure, the lexer must produce smaller tokens. Usually a lexer would translate "" into something like "STAGO, NAME(Foo), TAGC". When you have attributes, you get something like "STAGO, NAME(Foo), NAME(att1), EQ, STRING(a), TAGC" End tags *could* be recognized by the lexer, but they aren't. Programmers would handle them in a way similar to start tags, just for the consistency.