Path: csiph.com!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?
Date: Thu, 10 Mar 2022 09:48:48 +0100
Organization: Compilers Central
Lines: 50
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-025@comp.compilers>
References: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="37603"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 11 Mar 2022 14:48:00 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-019@comp.compilers>
Xref: csiph.com comp.compilers:2929

On 3/9/22 6:22 PM, Roger L Costello wrote:

> Okay, back to XML. Consider this non-well-formed XML:
> <Publisher>Harper&amp;Row</Publsher>
> (The end-tag is misspelled)
> The &amp; is called an "XML entity." An XML parser will convert it to &. The
> other XML entities are: &lt; ... &gt; ... &quot; ... &apos;
> What stage should the entity &amp; be converted to &?

In other languages digraphs and trigraphs are used as replacements for
special characters. All such character replacements are handled at the
begin of the character input stage (lexer). In XML it also could be
handled by a preprocessor, to extend your stages:

      0.  Preprocessor
>    1.  Lexical analysis stage
>    2.  Syntax analysis stage
>    3.  Semantic analysis stage

I prefer to describe/clarify the stages by their inputs and outputs:

A preprocessor inputs and outputs a stream of characters.
A Lexer reads a character stream and outputs a stream of terminal tokens.
A Parser accepts a stream of terminals, adds non-terminals from the
grammar, and outputs e.g. a tree structure.
Semantic analysis can be done during syntax analysis or later.

> What stage should detect that the <Publisher> start-tag does not have a
> matching end-tag?

As appropriate <g>. What should be the consequence of that mismatch?
It may be a quite harmless typo than can be fixed by auto correction.
Or it may indicate a missing closing tag if it matches some previous
opening tag?
Where in your implementation can you know enough about possible reasons
for the mismatch? Error handling and helpful error messages are a wide
and stony field <sigh>.

IMO it's up to the compiler writer to match the expectations of his
audience with such problems - warning, error, re-sync or abort processing?
Or you leave the handling to some user controlled compiler flags.


Don't take too seriously what you read about the one and only way to
classify or handle something. For XML (HTML...) you have a choice of DOM
or SAX parsing. Feel free to do it your way, after you have studied the
various approaches and pitfalls, and as long as you can be sure that the
results are correct and acceptable by your boss or users.

DoDi