Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <480-992-1380@kylheku.com> Newsgroups: comp.compilers Subject: Re: Learning only one lexer made me blind to its hidden assumptions Date: Fri, 15 Jul 2022 14:16:59 -0000 (UTC) Organization: A noiseless patient Spider Lines: 137 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-07-022@comp.compilers> References: <22-07-006@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="72062"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 15 Jul 2022 12:26:45 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3121 On 2022-07-07, Roger L Costello wrote: > Hi Folks, > > For months I have been immersed in learning and using Flex. Great fun indeed. > > But recently I have been reading a book, Crafting a Compiler with C, and > reading its chapter on lexers. The chapter describes two lexer-generators: > ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden > assumptions in Lex/Flex. Without learning ScanGen I would have continued to > think that the way things are done in Lex/Flex way is the only way. It seems hard to find information about ScanGen. I found someone's slides claim that it's something written by a Gary Sevitsky in 1979, and then worked on over the next couple of years by Robert Gray and Charles Fischer. By "learning ScanGen" do you mean that you have an implementation and have created some scanners? > Below I have documented some of the differences between Lex/Flex and ScanGen. > > Difference: > - Flex allows overlapping regexes. It is up to Flex to use the 'correct' > regex. Flex has rules for picking the correct one: longest match wins, regex > listed first wins. > - ScanGen does not allow overlapping regexes. Instead, you create one regex > and then, if needed, you create "Except" clauses. E.g., the token is an > Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write' > > Difference: > - Flex regexes use juxtaposition for specifying concatenation. > - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls That's verbose. Juxtaposition is used for catenation so that you can simply use "foo" if you want to match the text "foo". It behooves pattern matching notations to look like the problem space; i.e. that expression which just match single terms rather than sets just look like the terms they are matching. > it 'catenation' not 'concatenation' Careful speaker and writers of the English language avoid the "concatenate" pleonasm. "catena" is Latin for "chain" (e.g. "catenary curve: the curve formed by a chain supported horizontally at its endpoints, under gravity"). To catenate literally means to chain; the "con" semantics is implicit, since chaining is always "together". The Unix command is called "cat" not "concat"; the function is "strcat" not "strconcat". If you could think of a way to chain things apart, that could be grounds for a functional prefix. For instance, chaining a pair of agrressive dogs to opposite walls of a yard so they cannot get at each other could plausibly be called "discatenation". :) > Difference: > - Flex regexes use | for specifying alteration in regexes > - ScanGen uses ',' to specify alternation That's just gratuitously weird. Commas are widely used for separating groups in sequences: domain names, object member access, date.month.year, integer.fraction, filename.suffix, ... The | character is used in BNF. I can see that in 1979, stringent character set limitations would still have to be taken into account when developing something; there would still be reasons to stick to a six bit character set. > Difference: > - With Flex, tossing out characters (e.g., toss out the quotes surrounding a > string) may involve writing C code to reprocess the token > - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No > token reprocessing needed That C code is just: char *interior = yytext + 1; yytext[yyleng - 1] = 0; You could write a library a function like this, which you can reuse in future lex jobs: /* chop "front" characters from front, "back" characters from back, of yytext, and return pointer to chopped lexeme. */ char *yytrim(int front, int back) { assert (front >= 0); assert (back >= 0); assert (front <= yyleng); assert (back <= yyleng); { char *f = yytext + front; char *b = yytext + yyleng - back; if (b < f) b = f; *b = 0; return f; } } If the string literal syntax supports the escaping of quotes, and other features, stripping the quotes becomes insufficient. There are plenty of situations in which it is handy to strip something from either end of a lexeme, though; the above function would be useful in the "-ll" lex library. > Difference: > - Flex deals with individual characters > - ScanGen lumps characters into character classes and deals with classes. Use > of character classes decreases (quite significantly) the size of the > transition table Obviously, Lex has explicit character classes, like [cxb]: the class containing 'c', 'x' and 'b'. Even without explicit classes in the syntax, the DFA subset constrution implicitly identifies character classes. A regex like (c|x|b) should not generate more states than [cxb]. Subset construction will squeeze out things like, say (adcdx|abcdy), which should produce the same DFA as a[bd]cd[xy]. A state in a DFA can have many transitions leading out of it, triggered by different input symbols: that constitutes a character class. This class is essentially discovered by the algorithm. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal