Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: matt.timmermans@gmail.com Newsgroups: comp.compilers Subject: Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage? Date: Sat, 12 Mar 2022 05:12:25 -0800 (PST) Organization: Compilers Central Lines: 44 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-029@comp.compilers> References: <22-03-019@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="9665"; mail-complaints-to="abuse@iecc.com" Keywords: parse, design Posted-Date: 14 Mar 2022 11:37:39 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2932 On Wednesday, 9 March 2022 at 15:21:40 UTC-5, Roger L Costello wrote: > [...] > Okay, back to XML. Consider this non-well-formed XML: > Harper&Row > (The end-tag is misspelled) > The & is called an "XML entity." An XML parser will convert it to &. The > other XML entities are: < ... > ... " ... ' > What stage should the entity & be converted to &? > > 1. Lexical analysis stage > 2. Syntax analysis stage > 3. Semantic analysis stage > What stage should detect that the start-tag does not have a > matching end-tag? Other answers provide a discussion of how you make this decision in general. Specifically for XML, though, these are practical questions. Re Entities: - you can't really recognize them in lexical analysis, because they aren't valid everywhere. is not a valid tag, and has no entities in it. It can be convenient, though, for the lexer to capture them as ENTITY_REFERENCE tokens with their original text (like strings). Where they occur in CDATA sections, phase 3 can convert them back into their original text. Otherwise, the lexer should produce tokens like ENTITY_START, WORD_CHARS, ENTITY_END. - Regardless of what the lexer produces, the syntax analysis phase ensures that entities only occur in valid locations, and produces a parse tree with enough information to determine how they're handled. This is where is rejected as invalid. - During semantic processing, entities are converted to whatever their appropriate final form is. They will be converted into the indicated characters in strings or element content, or replaced with their original text in CDATA sections. Re: Tag Matching: If you include tag matching in syntax, then the syntax is not context-free and cannot be described with a context-free grammar... so don't do that. Practically, only semantic analysis can match the end tag with the preceding start tag. Unfortunately, that means that your parse tree cannot have the element hierarchy embedded in it. Your grammar cannot have an Element non-terminal.