Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: Please provide a learning path for mastering lexical analysis languages Date: Sun, 22 May 2022 12:12:32 +0300 Organization: Compilers Central Lines: 169 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-05-043@comp.compilers> References: <22-05-010@comp.compilers> <22-05-023@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48522"; mail-complaints-to="abuse@iecc.com" Keywords: lex, design Posted-Date: 22 May 2022 13:00:55 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3016 My apologies for this delayed reply. The startup I am working for is nearing the alpha release and I had things on my plate that I needed to give priority. Let me start my answer with two simple rules of thumb, in order of importance. 1) Keep your lexer simple. Follow Einstein's famous suggestion (on a slightly different topic). Make it as simple as possible, but no simpler. 2) Respect the lexer/parser boundary. Try to avoid mixing the two phases. That will help with item 1. And to that regard, don't look for a more powerful lexing mechanism when the job should be done in the parser. The exception is when it makes the combination simpler (i.e. you make your lexer just a little more complicated, but it makes the parser much simpler (or vice versa). So, I wouldn't go looking for lots of ways to make the lexer more powerful. Don't go looking for a flame thrower to light your barbecue grill. If you are seriously interested in regular expressions, I would recommend Friedl's book: Mastering Regular Expressions. That talks about nearly all the variations at a detailed level. But realize that the PCRE extensions (e.g. backreferences and zero-width assertions) are not generally available in lexical analysis tools. Those mechanisms are often overkill. In Snort, someone used them for matching quotes around expressions. Simpler would have been to write two rules, one for matched 's and one for matched "s. It probably would have run faster too. But someone saw that Snort rules allowed PCRE expressions and thought they would be clever. That goes to point 1. Write simple rules for each token. And, it is often better to have more rules that all generate the same token than to try and make one complicated regular expression that somehow combines them all. And, if you find yourself writing a complicated regular expression or a list of more than 5 different rules for the same token. Think twice about the problem you are solving. You might be doing the wrong thing at the wrong level. (Worth noting here that this point is true for language design. If your language design starts forcing you to invent new technologies or use complicated existing ones, ask if you are actually making the language simple to use, often it's an indication of a mistake in your language design. Now, if you are implementing an existing language, you may have no choice, but realize that the original designer may have made a mistake and you are just dealing with it.) By the way, don't be afraid to borrow or steal good ideas and reuse them. I have essentially written one lexer in Yacc++ and use it for all languages with only minor tweaks. Now, I'm not trying to parse Fortran 66 with it, but I have used it for Yacc++ itself, C/C++/C# dialects, Verilog, Pascal, Javascript, BASIC, and lots of tiny tools. I will likely use it for Rust with only the smallest of tweaks. I reuse that lexer for nearly everything. The structure of that lexer is quite simple. There are only a small set of rules that are actual regular expressions: identifiers, numbers (integers and floats distinguished), strings, whitespace, and comments. Punctuation is done by simple rules listing the legal "operators". Note, if I had operators like "&" I would make a regular expression for that in the first set. Note that keywords I handle separately. This is based upon advice Frank Deremer once gave. The lexer should be a scanner (regular expression part) and a screener (looking for keywords and the likes). Those are two separate tasks and splitting them often makes both simpler. (In fact, it's one of my few gripes about ANTLR4, as it doesn't make doing that easy. We built that model, split scanner/screener, into Yacc++ and it was one of our better choices. You can make a more powerful screener without making the scanner more complex and vice-versa. That goes back to the idea of making things simple by doing only one thing at a time.) -------- Now, since I know you are working on HTML or something similar, I will give one piece of advice where you want to use something a little more powerful. In markdown type languages, you often have two distinct portions, the markdown which is a little language that you want to tokenize and the rest of the text which you don't want to tokenize (or at least not the same way). That does suggest using lexer states (or some similar mechanism). You are really writing two lexers. One for inside markdown and another for the rest of the text. Don't try to mix them. The markdown is usually clearly marked e.g. between < and > or on lines that start with a special character (#) or something similer. So, markdown is one lexer state (or complete lexer) with one set of rules, while the rest of text is another lexer with a different set of rules, and the only tricky part is knowing when to switch between them (and in a good language, it is actually obvious). But, notice how that is a different implementation of separation of concerns. You have two lexers (or two lexer states). Each one doing one job and doing it simply. The hard part is the switching between them and that's not usually a complex regular expression, but the requirement to do something that is more "infrastructure" related. And, see how it answers your question about how to not lex "&" in text outside of markdown. You are simply using a different lexer [state]. It doesn't know about "&" and doesn't attempt to tokenize it. ------ Since, I mentioned screeners above, I will talk a bit about them. The simplest model is a function you can call upon matching an identifer (or another token you want to screen, e.g. maybe you have comments that can be "Javadocs" or decorations and you want to process them). This function can look the token up (or lex and parse it) and decide to change the token type (or even insert multiple tokens) before handing the token from the lexer to the parser. Note, if the function is complicated, you might be essentially implementing a preprocessor. (So, "#.*\n" could be recognized as a token, but the screener for it is the "C preprocessor"). And, the point here is that it is sometimes useful to insert something between the lexer and the parser (and maybe only for specific tokens.) Another use of a screener is the C "typedef hack" where you look your identifier up in the symbol table and see if it is a typedef and return a different token for it in that case. That allows your screen to take feedback from the parser without exposing it down to the scanner. Still, notice the separation of concerns to keep each part simpler. ----- Now, once you have a base simple lexer you are often done (or at least mostly done). Then, you can worry about the "mistakes". Comment processing is often a more complicated expression than one would like. It's worse if you allow nested comments that need balanced delimiters, e.g. "/*" and "*/". Balanced delimiters are a Dyck language and take you out of the regular expression world. You can do them with lexer states (if you can stack them), but it is easier if you simply have a CFG language (LL, LR, or PEG) to do so. Captures and backreferences can sometimes do it, provided they have a stack like quality. We had a similar issue in Yacc++, we handle nested braces as a single token "code". That makes the parser simpler. That token has matched brackets, strings, and comments inside. It is much more complicated than one really wants. It is essentially a mini-parser inside our lexers. And, it means we can't really use braces as tokens themselves. It's a sign that as language designers we make a mistake. It probably would have been better to use a lexical state model, see a brace "{" and go into a separate lexer to solve it. Sometimes having a flame thrower is not the right solution. ------ A slightly different mistake was made in the original yacc design. They make semicolons optional at the end of the rule. This means you [sometimes] need "identifier:" to recognize the start of a rule. That makes the lexer more complex. Other languages have played with treating newline as a semicolon which has its own issue, but these are often at the parser level not the lexer. I'm sure I have more to say on this topic, but also I have said more than enough.... Kind regards, Chris ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------