Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Roger L Costello Newsgroups: comp.compilers Subject: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage? Date: Wed, 9 Mar 2022 17:22:00 +0000 Organization: Compilers Central Lines: 57 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-019@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="88673"; mail-complaints-to="abuse@iecc.com" Keywords: parse, question Posted-Date: 09 Mar 2022 15:21:37 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Content-Language: en-US Xref: csiph.com comp.compilers:2924 Hello Compiler Experts! For learning purposes (and for fun) I want to build an XML parser. While an XML is not a programming language and an XML parser is not a compiler, I think that an XML parser performs the same steps as the front end of a compiler. I am reading a compiler book [1] and it says this: --------------------------------------------------- The front end can be divided into lexical analyzer, syntax analyzer, and semantic analyzer. The lexical analyzer, sometimes also called the scanner, carries out the simplest level of structural analysis. It will group the individual symbols of the source program text into their logical entities. Thus the sequence of characters 'W', 'H', 'I', 'L', and 'E' would be identified as the word 'WHILE' and the sequence of characters '1', '.', and '0' would be identified as the floating-point number 1.0. The syntax analyzer, often also called the parser, analyzes the overall structure of the whole program, grouping the simple entities identified by the scanner into the larger constructs, such as statements, loops, and routines, that make up the complete program. Once the structure of the program has been determined we can then analyze its meaning (or semantics). We can determine which variables are to hold integers, and which to hold floating point numbers, we can check that the size of all arrays is defined and so on. --------------------------------------------------- Okay, back to XML. Consider this non-well-formed XML: Harper&Row (The end-tag is misspelled) The & is called an "XML entity." An XML parser will convert it to &. The other XML entities are: < ... > ... " ... ' What stage should the entity & be converted to &? 1. Lexical analysis stage 2. Syntax analysis stage 3. Semantic analysis stage What stage should detect that the start-tag does not have a matching end-tag? 1. Lexical analysis stage 2. Syntax analysis stage 3. Semantic analysis stage Some background information: The Flex manual shows an example [2] of a lexer that scans a string which is enclosed in quotes. For this input: "Hello\040World" the lexical analyzer generates this token: Hello World Notice that the octal entity ( \040 ) has been resolved to its character (the space character). That example leads me to conclude that a lexical analyzer is responsible for converting XML entities, e.g., The lexical analyzer converts & to & However, the Flex manual showed that a lexer "could" resolve an octal entity, but the manual didn't say that the lexer "should" resolve entities, so I don't know if it is appropriate for the lexer to convert XML entities. What are your thoughts on this? /Roger [1] "Introduction to Compiling Techniques" by J.P. Bennett [2] See page 24, https://epaperpress.com/lexandyacc/download/flex.pdf