Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!nx01.iad01.newshosting.com!newshosting.com!198.186.194.249.MISMATCH!transit3.readnews.com!news-out.readnews.com!news-xxxfer.readnews.com!news.misty.com!news.iecc.com!nerds-end
From: "Karsten Nyblad" <uu3kw29sb7@snkmail.com>
Newsgroups: comp.compilers
Subject: coupling LALR with a scanner?
Date: Thu, 07 Jul 2011 10:46:18 +0200
Organization: Compilers Central
Lines: 50
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <11-07-016@comp.compilers>
References: <11-07-013@comp.compilers> <11-07-015@comp.compilers>
NNTP-Posting-Host: news.iecc.com
X-Trace: gal.iecc.com 1310060923 53347 64.57.183.58 (7 Jul 2011 17:48:43 GMT)
X-Complaints-To: abuse@iecc.com
NNTP-Posting-Date: Thu, 7 Jul 2011 17:48:43 +0000 (UTC)
Keywords: lex, parse, design
Posted-Date: 07 Jul 2011 13:48:43 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: x330-a1.tempe.blueboxinc.net comp.compilers:192

>I am exploring the possibility of choosing between several lexers
>depending on the current state of an LALR parser (i.e. if a state can
>accept productions such as A => A . 'a' b, B => . 'c', then the
>selected lexer will accept (at least) 'a' and 'c', for rightmost
>positions the lookahead symbols would be acceptable as well)

The problem with using an LALR parser for that is that the state is not
enough information to know which token are accepted.  An example:

A -> a B a
A -> b B b
B -> a

( A and B being non terminals and a and b being terminals.)
In an LALR parser you will get a state with just one item [ B -> a . ]
with a lookahead set of { a, b }.  However, just one of a and b is
acceptable.

There are three ways to get around that problem:

First you can use an LR parser instead.  Some parser generators
generate an LALR parser and use state splitting for solving conflicts
that come from LALR parsers being less powerful than LR parser.  Such
generators can easily be modified to split states in this case too, in
practise generating an LR parser.

Secondly you can change the parser driver program, i.e., the program
that is generated by the parser generator.  If it looks up information
in tables, it is easy to let the driver program, such that it finds out
which tokens are acceptable before asking the scanner for a token.  You
test whether or not a token is acceptable by copying the parse stack,
place the token being tested in the window, and then parse until the
token is pushed on the stack or an error occurs.  (Of course the first
indicates that the token is acceptable, and the second that it is not.)

Thirdly you can use a parser generator like Bison, that implements
general LR parsing.  General LR parser generators work on multiple
stacks.  You change the lexer such that it returns a set of tokens, a
token for each way the next tokens may be lexed.  Say the lexed
returns three tokens.  Then the driver program makes three copies of
the stacks, and parses on with each token being applied to its own
stack.

At last: You might get better answers if you are more detailed with
which problem you are trying to solve.  I have been interested in such
techniques for parsing languages like PL/I where keywords may be used
as identifiers, Ada where a single quote may be a token itself or the
start of a character constant, and BCPL where line ends may end
statements.  The best solution is not necessarily the same in all the
cases.