Path: csiph.com!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Hans-Peter Diettrich Newsgroups: comp.compilers Subject: Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage? Date: Thu, 10 Mar 2022 09:48:48 +0100 Organization: Compilers Central Lines: 50 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-025@comp.compilers> References: <22-03-019@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="37603"; mail-complaints-to="abuse@iecc.com" Keywords: design Posted-Date: 11 Mar 2022 14:48:00 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <22-03-019@comp.compilers> Xref: csiph.com comp.compilers:2929 On 3/9/22 6:22 PM, Roger L Costello wrote: > Okay, back to XML. Consider this non-well-formed XML: > Harper&Row > (The end-tag is misspelled) > The & is called an "XML entity." An XML parser will convert it to &. The > other XML entities are: < ... > ... " ... ' > What stage should the entity & be converted to &? In other languages digraphs and trigraphs are used as replacements for special characters. All such character replacements are handled at the begin of the character input stage (lexer). In XML it also could be handled by a preprocessor, to extend your stages: 0. Preprocessor > 1. Lexical analysis stage > 2. Syntax analysis stage > 3. Semantic analysis stage I prefer to describe/clarify the stages by their inputs and outputs: A preprocessor inputs and outputs a stream of characters. A Lexer reads a character stream and outputs a stream of terminal tokens. A Parser accepts a stream of terminals, adds non-terminals from the grammar, and outputs e.g. a tree structure. Semantic analysis can be done during syntax analysis or later. > What stage should detect that the start-tag does not have a > matching end-tag? As appropriate . What should be the consequence of that mismatch? It may be a quite harmless typo than can be fixed by auto correction. Or it may indicate a missing closing tag if it matches some previous opening tag? Where in your implementation can you know enough about possible reasons for the mismatch? Error handling and helpful error messages are a wide and stony field . IMO it's up to the compiler writer to match the expectations of his audience with such problems - warning, error, re-sync or abort processing? Or you leave the handling to some user controlled compiler flags. Don't take too seriously what you read about the one and only way to classify or handle something. For XML (HTML...) you have a choice of DOM or SAX parsing. Feel free to do it your way, after you have studied the various approaches and pitfalls, and as long as you can be sure that the results are correct and acceptable by your boss or users. DoDi