Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Paul B Mann Newsgroups: comp.compilers Subject: Re: Please provide a learning path for mastering lexical analysis languages Date: Sun, 8 May 2022 22:27:55 -0700 (PDT) Organization: Compilers Central Lines: 102 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-05-027@comp.compilers> References: <22-05-010@comp.compilers> <22-05-023@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="66705"; mail-complaints-to="abuse@iecc.com" Keywords: lex Posted-Date: 13 May 2022 13:03:24 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <22-05-023@comp.compilers> Xref: csiph.com comp.compilers:3004 /* Token Rules */ -> \z -> literal -> integer -> decimal -> real -> letter (letter|digit)* integer -> digit+ real -> integer exp -> decimal exp decimal -> digit+ '.' -> '.' digit+ -> digit+ '.' digit+ exp -> 'e' digit+ -> 'E' digit+ -> 'e' '-' digit+ -> 'E' '-' digit+ -> 'e' '+' digit+ -> 'E' '+' digit+ literal -> ''' lchar ''' lchar -> lany -> '\' '\' -> '\' ''' -> '\' '"' -> '\' 'n' -> '\' 't' -> '\' 'a' -> '\' 'b' -> '\' 'f' -> '\' 'r' -> '\' 'v' -> '\' '0' -> '"' schar* '"' schar -> sany -> '\' '\' -> '\' ''' -> '\' '"' -> '\' 'n' -> '\' 't' -> '\' 'a' -> '\' 'b' -> '\' 'f' -> '\' 'r' -> '\' 'v' -> '\' '0' {whitespace} -> whitechar+ {commentline} -> '/' '/' neol* {commentblock} -> '/' '*' na* '*'+ (nans na* '*'+)* '/' /* Character Sets */ any = 0..255 - \z lany = any - ''' - '\' - \n sany = any - '"' - '\' - \n letter = 'a'..'z' | 'A'..'Z' | '_' digit = '0'..'9' whitechar = \t | \n | \r | \f | \v | ' ' na = any - '*' // not asterisk nans = any - '*' - '/' // not asterisk not slash neol = any - \n // not end of line \t = 9 // tab \n = 10 // newline \v = 11 // vertical feed? \f = 12 // form feed \r = 13 // return \z = 26 // end of file \b = 32 // blank/space /* End */ The above lexical rules define C-language symbols. It's just a lexical grammar, not too hard to figure out. This is input to the DFA lexer generator, which is provided with the LRSTAR parser generator on SourceForge.net. DFA creates lexers that run 80% faster than "flex" lexers and are about the same size. If you need more language power to define a lexer ... that's what parser are for. BTW, LRSTAR creates parsers in C++ than were running 140 times faster than those created by ANTLR, using the C++ target, the last time I did a comparison, 2 years ago.