Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: "matt.ti...@gmail.com" <matt.timmermans@gmail.com>
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Sun, 20 Mar 2022 07:32:14 -0700 (PDT)
Organization: Compilers Central
Lines: 62
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-044@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="70947"; mail-complaints-to="abuse@iecc.com"
Keywords: C, syntax
Posted-Date: 20 Mar 2022 15:10:23 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-032@comp.compilers>
Xref: csiph.com comp.compilers:2946

On Thursday, 17 March 2022 at 14:41:47 UTC-4, Roger L Costello wrote:
> For instance, as I understand it a C preprocessor goes through a C program
and replaces macros. With this:

The C preprocessor is a completely different language.  It compiles (many
would say "transpiles" these days) a C-with-macros file into a
C-without-macros file.

C, therefore, is a composition of two languages.

I think this turned out to be a *very bad idea*, except that we didn't know
enough about what the preprocessor would be used for to put these features in
the real language.  Either way, you should not do anything like this, and you
should not think of what the lexer does, or what the parser does, or what
semantic analysis does, as any sort of "text replacement".

The output of the lexer is a token stream, and the output of the parser is an
AST, or some other intermediate representation that is richer than a token
stream.  Text needs to be parsed, so it any of these stages produce
intermediate text, you have to start again at lexing.

> Similarly, in XML if &amp; is embedded inside a CDATA section:
>
> <![CDATA[&amp;]]>
>
> then a preprocessor must not replace &amp; with &. That is, the preprocessor
must have knowledge about the language: If an XML entity is within a CDATA
section, then don’t replace it.

Yeah, a preprocessing stage (which would require its own lexer and parser) is
not useful.  Don't do it.  A lexer can produce a rich token like
ENTITY_REF("&amp;"), which has enough information in it to be useful in either
context.

> 2. How much knowledge of the language should the lexical analysis stage
have?
> 3. How much knowledge of the language should the syntax analysis stage have?

As much as you need. Don't worry about it.  All of these things are made with
deep knowledge of the language being parsed.  The lexer is simple and fast and
divides the text into labelled atomic units.  Use whatever division is
convenient.  The only real restriction is that the lexer should not produce
anything that isn't atomic (may need to be subdivided later), or that requires
a rich internal structure (because you don't want to run token text through
another parser), or that can't be recognized with a regular language (because
that's how lexers work).

> Should the lexical analysis stage know that the foo in <foo> is a
> start tag (STAG) and the foo in </foo> is an end tag (ETAG)? That
> would mean the lexical analysis stage has considerable knowledge of
> the XML language. Or should the lexical analysis stage simply identify
> the foo in <foo> as a name (NAME) and the foo in </foo> as a name
> (NAME)?

Neither.  Start tags can have a rich internal structure like <foo att1="a"
att2="b">.  To provide access to that structure, the lexer must produce
smaller tokens.  Usually a lexer would translate "<Foo>" into something like
"STAGO, NAME(Foo), TAGC".  When you have attributes, you get something like
"STAGO, NAME(Foo), NAME(att1), EQ, STRING(a), TAGC"

End tags *could* be recognized by the lexer, but they aren't.  Programmers
would handle them in a way similar to start tags, just for the consistency.