Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Roger L Costello Newsgroups: comp.compilers Subject: Learning only one lexer made me blind to its hidden assumptions Date: Thu, 7 Jul 2022 17:49:44 +0000 Organization: Compilers Central Lines: 55 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-07-006@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="18181"; mail-complaints-to="abuse@iecc.com" Keywords: lex, question, comment Posted-Date: 11 Jul 2022 20:26:04 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Content-Language: en-US Xref: csiph.com comp.compilers:3110 Hi Folks, For months I have been immersed in learning and using Flex. Great fun indeed. But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way. Below I have documented some of the differences between Lex/Flex and ScanGen. Difference: - Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins. - ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write' Difference: - Flex regexes use juxtaposition for specifying concatenation. - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls it 'catenation' not 'concatenation' Difference: - Flex regexes use | for specifying alteration in regexes - ScanGen uses ',' to specify alternation Difference: - With Flex, tossing out characters (e.g., toss out the quotes surrounding a string) may involve writing C code to reprocess the token - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No token reprocessing needed Difference: Flex regexes use ^ for specifying 'not', e.g., [^ab] means any char except a and b ScanGen regexes uses 'Not', e.g., Not(Quote) Difference: - Flex deals with individual characters - ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table Difference: - Flex regexes use the ? meta-symbol - ScanGen doesn't have that. Instead, it has 'Epsilon' Difference: - ScanGen has something called a Major number and a Minor number for each token - Flex doesn't have that concept [For the same reason, I don't think it's a good idea to learn only one programming langage. -John]