Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Fri, 18 Mar 2022 17:50:22 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 194
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-037@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="29574"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 18 Mar 2022 14:00:37 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2939

On 2022-03-15, Roger L Costello <costello@mitre.org> wrote:
> But if PI is inside a quoted string:
>
> "Today is PI day"
>
> then the preprocessor does not replace PI.

This was not historically true, though. I know for a fact that ancient
versions of the C preprocessor would have replaced PI in character
constants, as in 'PI'. Possibly, some versions of it also didn't
care about quotes.

> 1. How much knowledge of the language should the preprocessor stage have?

A preprocessor doesn't have to have any knowledge of the language
being preprocessed.

If the preprocessor is universal, like m4, that is practically a given.

If a preprocessor is dedicated to a language, like the C one,
it's a good idea for it to work at the token level: tokenize its
input the same way, or at least very similarly, to the target language.

In C, a preprocessor token is a kind of superset of a proper token.
There are pp-number tokens that are not number tokens, for instance,
but not vice versa.

> 2. How much knowledge of the language should the lexical analysis stage have?
>
> 3. How much knowledge of the language should the syntax analysis stage have?

The traditional lexical/syntax division comes from a blueprint for
programming language processing laid down in the work of computer
scientists done in the 1960's and 1970's.

It is not necessary; a context-free grammar can handle the entire syntax
down to the token level, like a decimal integer being a sequence of
digits and whatnot.

The division into lexical analysis and parsing comes from several
pragmatic observations:

1.  There are dedicated algorithms that are good for tokenizing.
    Tokens exhibit regular syntax: they can be recognized using
    finite automata (regular expressions).

2.  There are efficient parsing techniques like LALR(1) that assume
    that there is one symbol of lookahead. If you use LALR(1)
    with a scanner, you get one *token* of lookahead.
    If you use LALR(1) without a scanner (token == character)
    then you get one *character* of lookahead, which is going to
    be crippling. Thus, separating lexical analysis effectively
    amplifies LALR(1) into LALR(k), for some k related to maximum
    token character length.

E.g. if you design a C parser without a scanner, over the character
level syntax, when that parser sees the character "do", it
has no idea whether it's looking at the "do" keyword, the
start of the "double" keyword, or something else like "dongle".
One character of lookahead will confirm or eliminate "do",
but not resolve "double" versus "dongle".

So to answer the questions, if you're assuming that you're going to be
using the traditional framework, with a regular-expression-driven
lexer and a LR parser with 1 token of lookahead, the way you divide
the work, roughly, is by identifying the token-like elements in the
language that have regular syntax. Anything that exhibits nesting,
requiring rule recursion, will be farmed off to the parser.  Your
decision could sometimes also be informed by the lookahead concern;
you could choose to clump together several items that might plausibly
be individual tokens into a "supertoken" if there is some parsing
advantage in it, or other simplification.

Other kinds of decisions interact with the language definition.
For instance, what is -1234?  In Common Lisp, and most other Lisp
dialect, I suspect, that is a token denoting a negative integer.
It may not be written - 1234.  In C, and languages imitating its syntax,
-1234 is a unary expression: the unary - operator applied to the
integer 1234, and so there are two tokens. They may be separated
by whitespace, or a comment.

This strikes at the language definition, becuase what it implies is
that C does not have negative constants.

And that causes a subtle problems.

Say the int type is 32 bits wide.

The most negative 32 bit two's complement integer is -2147483648.

But in C this is actually the negation of 2147483648 which does
not fit into the type int. The resulting constant expression will
not have type int.

Defining the INT_MIN constant in <limits.h> such that
the expression has type int, and doesn't contain any casts
requires something like:

  #define INT_MIN (-2147483647 - 1)

> 4. How much knowledge of the language should the semantic analysis stage have?

All knowledge that the previous stages do not have, minus
knowledge that only the application domain has. :)

>
> To make the questions concrete, consider this XML:
>
><foo>…</foo>
>
> Should the lexical analysis stage know that the foo in <foo> is a
> start tag (STAG) and the foo in </foo> is an end tag (ETAG)?

XML has structure inside tags, namely attributes. Various implementation
choices present themselves.

One possibility is to recognize the special cases <abc>, </abc>
and <abc/> as a special token categories, e.g

 ELEM_START_TRIVIAL
 ELEM_END
 ELEM_COMPLETE

All three token categories will carry a semantic attribute on the token
which gives the identity of the element, e.g "abc".

Then the case like <abc foo="bar">

The "<abc" part could produce a ELEM_START_OPEN token (distinct from
ELEM_START_TRIVIAL).  After this token, zero or or more attributes
are expected, followed by ELEM_START_CLOSE produced by the angle
bracket.

The parser has to recognize that ELEM_START_OPEN has been seen,
and then parse the attribute tokens, looking for ELEM_START_CLOSE.

Note that ELEM_START_TRIVIAL is then not necessary, becuase
<a> could produce ELEM_START_OPEN followed immediately by
ELEM_START_CLOSE.

> Or should the lexical analysis stage simply identify
> the foo in <foo> as a name (NAME) and the foo in </foo> as a name
> (NAME)? That would mean the lexical analysis stage has lesser
> knowledge of the XML language. How much knowledge of the XML language
> should the lexical analysis stage have?

I would say that because the attributes have special syntax with
quoting like foo="bar", it would be better for the lexer to be aware
of the tag structure, so that it can switch to a mode for precisely
handling attributes.

Consider <abc foo="bar">foo="bar"</a>.  Do you want to tokenize the
attribute foo="bar" in the same way as the interior of the element?

Chances are you want to turn the entire content of the <a> element,
into a single text datum, if it doesn't contain any sub-elements.
The quotes and equal sign are not special in any way.

Also, attributes can have angle brackets; so that is to say,
I think <abc foo="<abc>"> is valid.

If the tokenizer is oblivious to all this, it will lead to ugly
complexity in the parser.

As a final general remark, languages are often defined at multiple
description levels which can be identified as lexical and syntactic.
If you're implementing, it behooves you to follow those separations
in the spec, except when your own experience tells you to adjust the
interpretation here and there; like you recognize that some
syntactic aspect is nicely handled by your scanner or vice versa.

The W3C's definition of XML uses one big grammar, without regard
for separation into lexing and parsing. However, you can recognize
regular elements in it. Consider this grammar rule:

   Comment   ::=    '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

where:

   Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
                  [#x10000-#x10FFFF] /* any Unicode character,
                                        excluding the surrogate blocks,
                                        FFFE, and FFFF. *

Note that this is entirely regular: we can convert the Comment match
entirely into the rules for a lexical analyzer, and therefore
Comment can be a token. So that would be the main principle here:
scan the tree structur of the XML grammar and look for the tallest
subtrees that are entirely regular, considering those to be token
candidates. Or something like that.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal