Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!nx01.iad01.newshosting.com!newshosting.com!198.186.194.249.MISMATCH!transit3.readnews.com!news-out.readnews.com!news-xxxfer.readnews.com!news.misty.com!news.iecc.com!nerds-end From: "Karsten Nyblad" Newsgroups: comp.compilers Subject: coupling LALR with a scanner? Date: Thu, 07 Jul 2011 10:46:18 +0200 Organization: Compilers Central Lines: 50 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-07-016@comp.compilers> References: <11-07-013@comp.compilers> <11-07-015@comp.compilers> NNTP-Posting-Host: news.iecc.com X-Trace: gal.iecc.com 1310060923 53347 64.57.183.58 (7 Jul 2011 17:48:43 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Thu, 7 Jul 2011 17:48:43 +0000 (UTC) Keywords: lex, parse, design Posted-Date: 07 Jul 2011 13:48:43 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:192 >I am exploring the possibility of choosing between several lexers >depending on the current state of an LALR parser (i.e. if a state can >accept productions such as A => A . 'a' b, B => . 'c', then the >selected lexer will accept (at least) 'a' and 'c', for rightmost >positions the lookahead symbols would be acceptable as well) The problem with using an LALR parser for that is that the state is not enough information to know which token are accepted. An example: A -> a B a A -> b B b B -> a ( A and B being non terminals and a and b being terminals.) In an LALR parser you will get a state with just one item [ B -> a . ] with a lookahead set of { a, b }. However, just one of a and b is acceptable. There are three ways to get around that problem: First you can use an LR parser instead. Some parser generators generate an LALR parser and use state splitting for solving conflicts that come from LALR parsers being less powerful than LR parser. Such generators can easily be modified to split states in this case too, in practise generating an LR parser. Secondly you can change the parser driver program, i.e., the program that is generated by the parser generator. If it looks up information in tables, it is easy to let the driver program, such that it finds out which tokens are acceptable before asking the scanner for a token. You test whether or not a token is acceptable by copying the parse stack, place the token being tested in the window, and then parse until the token is pushed on the stack or an error occurs. (Of course the first indicates that the token is acceptable, and the second that it is not.) Thirdly you can use a parser generator like Bison, that implements general LR parsing. General LR parser generators work on multiple stacks. You change the lexer such that it returns a set of tokens, a token for each way the next tokens may be lexed. Say the lexed returns three tokens. Then the driver program makes three copies of the stacks, and parses on with each token being applied to its own stack. At last: You might get better answers if you are more detailed with which problem you are trying to solve. I have been interested in such techniques for parsing languages like PL/I where keywords may be used as identifiers, Ada where a single quote may be a token itself or the start of a character constant, and BCPL where line ends may end statements. The best solution is not necessarily the same in all the cases.