Learning only one lexer made me blind to its hidden assumptions

From	Roger L Costello <costello@mitre.org>
Newsgroups	comp.compilers
Subject	Learning only one lexer made me blind to its hidden assumptions
Date	2022-07-07 17:49 +0000
Organization	Compilers Central
Message-ID	<22-07-006@comp.compilers> (permalink)

Show all headers | View raw

Hi Folks,

For months I have been immersed in learning and using Flex. Great fun indeed.

But recently I have been reading a book, Crafting a Compiler with C, and
reading its chapter on lexers. The chapter describes two lexer-generators:
ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
assumptions in Lex/Flex. Without learning ScanGen I would have continued to
think that the way things are done in Lex/Flex way is the only way.

Below I have documented some of the differences between Lex/Flex and ScanGen.

Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct'
regex. Flex has rules for picking the correct one: longest match wins, regex
listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex
and then, if needed, you create "Except" clauses. E.g., the token is an
Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls
it 'catenation' not 'concatenation'

Difference:
- Flex regexes use | for specifying alteration in regexes
- ScanGen uses ',' to specify alternation

Difference:
- With Flex, tossing out characters (e.g., toss out the quotes surrounding a
string) may involve writing C code to reprocess the token
- ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No
token reprocessing needed

Difference:
Flex regexes use ^ for specifying 'not', e.g., [^ab] means any char except a
and b
ScanGen regexes uses 'Not', e.g., Not(Quote)

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use
of character classes decreases (quite significantly) the size of the
transition table

Difference:
- Flex regexes use the ? meta-symbol
- ScanGen doesn't have that. Instead, it has 'Epsilon'

Difference:
- ScanGen has something called a Major number and a Minor number for each
token
- Flex doesn't have that concept
[For the same reason, I don't think it's a good idea to learn only one programming langage. -John]

Back to comp.compilers | Previous | Next — Next in thread | Find similar

Thread

Learning only one lexer made me blind to its hidden assumptions Roger L Costello <costello@mitre.org> - 2022-07-07 17:49 +0000
  Re: Learning only one lexer made me blind to its hidden assumptions luser droog <luser.droog@gmail.com> - 2022-07-12 19:49 -0700
    Re: Learning only one lexer made me blind to its hidden assumptions Juan Miguel Vilar Torres <jvilar@uji.es> - 2022-07-13 01:46 -0700
  Re: Learning only one lexer made me blind to its hidden assumptions "Ev. Drikos" <drikosev@gmail.com> - 2022-07-13 14:58 +0300
  Re: Learning only one lexer made me blind to its hidden assumptions antispam@math.uni.wroc.pl - 2022-07-13 19:52 +0000
    Re: Learning only one lexer made me blind to its hidden assumptions George Neuner <gneuner2@comcast.net> - 2022-07-14 16:46 -0400
      Re: Learning only one lexer made me blind to its hidden assumptions antispam@math.uni.wroc.pl - 2022-07-15 20:14 +0000
  Re: Learning only one lexer made me blind to its hidden assumptions Kaz Kylheku <480-992-1380@kylheku.com> - 2022-07-15 14:16 +0000

csiph-web