Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Fri, 15 Jul 2022 14:16:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 137
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-022@comp.compilers>
References: <22-07-006@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="72062"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 15 Jul 2022 12:26:45 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3121

On 2022-07-07, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> For months I have been immersed in learning and using Flex. Great fun indeed.
>
> But recently I have been reading a book, Crafting a Compiler with C, and
> reading its chapter on lexers. The chapter describes two lexer-generators:
> ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
> assumptions in Lex/Flex. Without learning ScanGen I would have continued to
> think that the way things are done in Lex/Flex way is the only way.

It seems hard to find information about ScanGen. I found someone's
slides claim that it's something written by a Gary Sevitsky in 1979,
and then worked on over the next couple of years by Robert Gray
and Charles Fischer.

By "learning ScanGen" do you mean that you have an implementation
and have created some scanners?

> Below I have documented some of the differences between Lex/Flex and ScanGen.
>
> Difference:
> - Flex allows overlapping regexes. It is up to Flex to use the 'correct'
> regex. Flex has rules for picking the correct one: longest match wins, regex
> listed first wins.
> - ScanGen does not allow overlapping regexes. Instead, you create one regex
> and then, if needed, you create "Except" clauses. E.g., the token is an
> Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'
>
> Difference:
> - Flex regexes use juxtaposition for specifying concatenation.
> - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls

That's verbose. Juxtaposition is used for catenation so that you can
simply use "foo" if you want to match the text "foo".

It behooves pattern matching notations to look like the problem space;
i.e. that expression which just match single terms rather than sets
just look like the terms they are matching.

> it 'catenation' not 'concatenation'

Careful speaker and writers of the English language avoid the
"concatenate" pleonasm.

"catena" is Latin for "chain" (e.g. "catenary curve: the curve formed by
a chain supported horizontally at its endpoints, under gravity"). To
catenate literally means to chain; the "con" semantics is implicit,
since chaining is always "together".

The Unix command is called "cat" not "concat"; the function is "strcat"
not "strconcat".

If you could think of a way to chain things apart, that could be
grounds for a functional prefix.  For instance, chaining a pair of
agrressive dogs to opposite walls of a yard so they cannot get at each
other could plausibly be called "discatenation". :)

> Difference:
> - Flex regexes use | for specifying alteration in regexes
> - ScanGen uses ',' to specify alternation

That's just gratuitously weird.  Commas are widely used for separating
groups in sequences: domain names, object member access,
date.month.year, integer.fraction, filename.suffix, ...

The | character is used in BNF.

I can see that in 1979, stringent character set limitations would still
have to be taken into account when developing something; there would
still be reasons to stick to a six bit character set.

> Difference:
> - With Flex, tossing out characters (e.g., toss out the quotes surrounding a
> string) may involve writing C code to reprocess the token
> - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No
> token reprocessing needed

That C code is just:

   char *interior = yytext + 1;
   yytext[yyleng - 1] = 0;

You could write a library a function like this, which you can reuse in
future lex jobs:

  /* chop "front" characters from front, "back" characters
     from back, of yytext, and return pointer to chopped
     lexeme. */

  char *yytrim(int front, int back)
  {
    assert (front >= 0);
    assert (back >= 0);
    assert (front <= yyleng);
    assert (back <= yyleng);

    {
      char *f = yytext + front;
      char *b = yytext + yyleng - back;
      if (b < f)
        b = f;
      *b = 0;
      return f;
    }
  }

If the string literal syntax supports the escaping of quotes, and other
features, stripping the quotes becomes insufficient.

There are plenty of situations in which it is handy to strip something
from either end of a lexeme, though; the above function would be useful
in the "-ll" lex library.

> Difference:
> - Flex deals with individual characters
> - ScanGen lumps characters into character classes and deals with classes. Use
> of character classes decreases (quite significantly) the size of the
> transition table

Obviously, Lex has explicit character classes, like [cxb]: the class
containing 'c', 'x' and 'b'.

Even without explicit classes in the syntax, the DFA subset constrution
implicitly identifies character classes. A regex like (c|x|b) should not
generate more states than [cxb].

Subset construction will squeeze out things like, say (adcdx|abcdy),
which should produce the same DFA as a[bd]cd[xy].

A state in a DFA can have many transitions leading out of it, triggered
by different input symbols: that constitutes a character class.
This class is essentially discovered by the algorithm.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal