Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: George Neuner Newsgroups: comp.compilers Subject: Re: Please provide a learning path for mastering lexical analysis languages Date: Sun, 08 May 2022 12:52:14 -0400 Organization: A noiseless patient Spider Lines: 121 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-05-023@comp.compilers> References: <22-05-010@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="39045"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 08 May 2022 14:45:37 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3002 On Fri, 6 May 2022 11:59:17 +0000, Roger L Costello wrote: >I want to master lexical analysis. > >In Chris Christopher's last post he said: > >> [Flex] is not even as powerful as some other lexical analysis languages >> and even exploiting its power often requires things that cannot be >> expressed in Flex alone nor can they be done in ways that are simple >> and easy to reason about. > >I am currently in the process of mastering Flex. I feel that mastering Flex >will give me a foundation for learning other more advanced lexical analysis >languages. "This XYZ lexical analysis language does it this way, Flex does it >this other way. Ah, yes, I can see how XYZ's way is simpler and more >powerful." > >From Chris's post I see that there is much to learn beyond Flex. Thank you >Chris. > >Can you provide a learning path, please? A learning path for mastering lexical >analysis languages. It's not the grammars (languages) used by the tools that you need to learn - it's the methods used the tools that are important. >After I master Flex, what lexical analysis language should I then master? And >after mastering that, what is the next lexical analysis language that I should >master? > >/Roger Lexical analysis is not terribly interesting ... it really just is about tokenizing the input. There are 2 main techniques: regular expressions, and recursive descent code. In practice, these may be combined. Flex essentially is regex plus a pushdown stack. The stack offers some limited ability to recognize simple 'inclusion' languages that regex can't handle alone, but in the Chomsky sense it does not add very much. You can look at ANTLR for an example of RD lexing. There are others. The main difference between regex and RD is that regex is an unguided method which simply changes state in the match automaton in response to incoming characters. In contrast, the incoming character in RD guides the path taken through the matching code: e.g., if ('h' == ch) if ('e' == ch) if ('l' == ch) if ('p' == ch) { // recognized 'help' } else if ('l' == ch) : In practice real RD code often matches whole words or word prefixes rather than individual characters as shown here, but the principle is the same. And sometimes regex are used to perform the word matches. Grammatical syntax analysis - i.e. parsing - is where things get interesting. There are 3 typical methods of parsing: LL, LR, and PEG. LL and PEG typically are associated with RD code. LR typically is associated with table driven pushdown automata. [I do recall seeing a paper in which the LR automata was /implemented/ using RD code in CPS, but naturally I can't find that paper just now. It really just proves that there are more ways to implement FA than many people realize 8-). I have not similarly seen an LL parser implemented with table or graph FA/PDA, but I suspect that could be done as well. Chris Clark probably knows better.] You also may see references also to 'combinator' parsing, which some people consider to be a unique method, but which more typically is associated with PEG. IMO it is simply an implemenation technique that is equally applicable to LL, but MMV on this point because - although their implementation looks similar - there are some not-insignificant technical differences between LL and PEG. Bison is an LR based tool associated strongly with Flex. Bison offers a few different variants of LR (LALR,SLR,GLR) that offer different limitations on the recognized language and (where they overlap) differences in the sizes of the generated parsers. Parser size was much more important in the past - nowadays it's only relevant to particular programming niches such as embedded. ANTLR is an LL based tool that generates RD code for both parsers and lexers. Since v3, ANTLR is LL(*) - prior to v3, ANTLR was LL(k). Both LL(k) and LL(*) are interesting in their own right and you may wish to look at both. [Many LL tools only are LL(1). LL(k) and LL(*) both deal with management of lookahead. LL(k) was created to deal with limitations of LL(1), and LL(*) relieves the programmer of having to determine the proper 'k' for the grammar.] The biggest difference is in how you deal with recursive structure in the language: in LL (or PEG) you need to right-factor your grammar, in LR you need to left-factor instead. The factoring determines the order in which recursive structures are recognized. I personally have little experience with existing PEG tools ... hopefully someone can give you some suggestions. Combinator parsers usually are hand written using a library of matching (or match creating) functions. Most often combinator parsers are seen in [higher order] functional languages where the functions easily can be strung together, but there are at least a few libraries available for C and C++. Hope this helps. George