Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Roger L Costello <costello@mitre.org>
Newsgroups: comp.compilers
Subject: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?
Date: Wed, 9 Mar 2022 17:22:00 +0000
Organization: Compilers Central
Lines: 57
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="88673"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, question
Posted-Date: 09 Mar 2022 15:21:37 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-US
Xref: csiph.com comp.compilers:2924

Hello Compiler Experts!
For learning purposes (and for fun) I want to build an XML parser.
While an XML is not a programming language and an XML parser is not a
compiler, I think that an XML parser performs the same steps as the front end
of a compiler.
I am reading a compiler book [1] and it says this:

---------------------------------------------------
The front end can be divided into lexical analyzer, syntax analyzer, and
semantic analyzer. The lexical analyzer, sometimes also called the scanner,
carries out the simplest level of structural analysis. It will group the
individual symbols of the source program text into their logical entities.
Thus the sequence of characters 'W', 'H', 'I', 'L', and 'E' would be
identified as the word 'WHILE' and the sequence of characters '1', '.', and
'0' would be identified as the floating-point number 1.0.
The syntax analyzer, often also called the parser, analyzes the overall
structure of the whole program, grouping the simple entities identified by the
scanner into the larger constructs, such as statements, loops, and routines,
that make up the complete program.
Once the structure of the program has been determined we can then analyze its
meaning (or semantics). We can determine which variables are to hold integers,
and which to hold floating point numbers, we can check that the size of all
arrays is defined and so on.
---------------------------------------------------

Okay, back to XML. Consider this non-well-formed XML:
<Publisher>Harper&amp;Row</Publsher>
(The end-tag is misspelled)
The &amp; is called an "XML entity." An XML parser will convert it to &. The
other XML entities are: &lt; ... &gt; ... &quot; ... &apos;
What stage should the entity &amp; be converted to &?

  1.  Lexical analysis stage
  2.  Syntax analysis stage
  3.  Semantic analysis stage
What stage should detect that the <Publisher> start-tag does not have a
matching end-tag?

  1.  Lexical analysis stage
  2.  Syntax analysis stage
  3.  Semantic analysis stage
Some background information: The Flex manual shows an example [2] of a lexer
that scans a string which is enclosed in quotes. For this input:
    "Hello\040World"
the lexical analyzer generates this token:
    Hello World
Notice that the octal entity ( \040 ) has been resolved to its character (the
space character). That example leads me to conclude that a lexical analyzer is
responsible for converting XML entities, e.g.,
    The lexical analyzer converts &amp; to &
However, the Flex manual showed that a lexer "could" resolve an octal entity,
but the manual didn't say that the lexer "should" resolve entities, so I don't
know if it is appropriate for the lexer to convert XML entities. What are your
thoughts on this?
/Roger
[1] "Introduction to Compiling Techniques" by J.P. Bennett
[2] See page 24, https://epaperpress.com/lexandyacc/download/flex.pdf