Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: State-of-the-art algorithms for lexical analysis? Date: Mon, 6 Jun 2022 21:16:26 +0300 Organization: Compilers Central Lines: 62 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-06-013@comp.compilers> References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="78500"; mail-complaints-to="abuse@iecc.com" Keywords: lex, performance Posted-Date: 06 Jun 2022 16:01:45 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3051 As our moderator wisely states: > Regular expressions have the advantage that once you've paid the one-time cost > of making a DFA, the matching is extremely fast. Yhe lexer is usually > one of the slowest parts of a compiler since it is the only part that has to > look at each character of the source program, so this is a place where speed > matters. And, for most cases they really are sufficient, and it really behooves one to stay within those limits. Why? Because when you get a syntax error at the lexical level, which is surprisingly frequent unless you never mistype closing quotes, you get whole sections of your code misparsed and rarely does the compiler error correction help much. Other single character errors , not . missing or extra ( { [ or ] } ) or ; have similar disastrous effects on program meaning, often not detected until much later. And, as I mentioned before, having the lexer be simply a scanner and putting any extra semantics into a separate screener (per Frank Deremer's recommendation) makes it all much simpler. You end up with small state machines with very few states that easily fit in even small machine caches or can be turned into circuitry, FPGAs or ASICs that use minimal numbers of gates. Those things can often run as fast as you can read the text in. And the screener being much less frequent can do more complex things without imposing a significant penalty. The screener is essentially running at parser speed and only looking at "long" tokens not single (or double) character ones. And sadly, you cannot go very much faster. Too often the transitions occur at single character boundaries. One is lucky when it is a two-character sequence and longer sequences terminating a token are rare enough to be in the measurement noise. I know because I tried to adapt the Boyer-Moore ideas once (skip and reverse) and found that they were essentially ineffective for tokenization. They might apply occasionally in parsing, but that's not as much of a performance hog. Unless you are interested in dealing with nested comments or something similar, you don't need a stack in your lexer and so no reason to do LL or LR parsing. (Yes, we extended our Yacc++ lexer to do LR parsing but with special casing so that the stack cost was only there if you had recursive productions and only tracked the start of the recursive production so that you were staying in DFA mode essentially all the time. And, while that helped us in a few cases, it isn't something I would say was important nor recommend.) The only place I might have found it interesting is if we made it recognize tokens inside of strings or comments for use in error correction to help with the missing close character cases. That might have made it worthwhile. But that would probably have needed to be done only in the presence of syntax errors with a string or comment in the recent context. In fact, there is only thing that I have not seen a DFA lexer do that I think is worth doing at the lexical level (and not via a screener). That is recognizing tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are common in things like network protocols and they would be relatively easy to implement, but I've not seen it done. Beyond that it is my relatively firm belief that one should almost always have only simple regular expressions, e.g. that the one for floating point numbers should be one of the most complex ones. Otherwise you are trying to do too much in the scanner. And you are asking for trouble when you do. Kind regards, Chris