Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.lightlink.com!news.iecc.com!lnews.iecc.com!nerds-end
From: Kaz Kylheku <kaz@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: shift/reduce problem
Date: Thu, 20 Oct 2011 05:47:26 +0000 (UTC)
Organization: A noiseless patient Spider
Lines: 92
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <11-10-013@comp.compilers>
References: <10-12-004@comp.compilers>
NNTP-Posting-Host: lnews.iecc.com
X-Trace: gal.iecc.com 1319105744 72711 64.57.183.34 (20 Oct 2011 10:15:44 GMT)
X-Complaints-To: abuse@iecc.com
NNTP-Posting-Date: Thu, 20 Oct 2011 10:15:44 +0000 (UTC)
Keywords: parse
Posted-Date: 20 Oct 2011 06:15:44 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: x330-a1.tempe.blueboxinc.net comp.compilers:295

On 2010-12-02, JohnSmith <csnews77@gmail.com> wrote:
> I have to parse the following string like this
>
> "(input i, j, k, input u, v, w)"

(In case anyone still cares about this nearly one year old thread. )

The crux of the problem here is caused by the comma.

The language in this parenthesized construct is actually a simple
regular language, as witnessed when we strip away the annoying commas,
and Lispify it a little bit:


  (:input i j k :input u v w)

Yacc chokes on the original because the comma is overloaded with two
meanings: variable separator and segment separator. Furthermore, the
comma occupies a precious lookahead position. You don't know which
comma you are dealing with, and you can't look beyond it at the other
tokens which would tell you.

Instead of the next token being either INPUT or a variable symbol,
there is a comma there. The state machine doesn't know what to do. Is
it the comma which should reduce "input i j k" to an input
nonterminal? Or is it a variable separating comma after which we can
expect another variable?

Here is a Yacc file for a similar grammar with a shift/reduce conflict:

  %{

  %}

  %token '(' ',' ')'
  %token INPUT VAR

  %%

  file: '(' list ')'

  list: input | list ',' input

  input: INPUT varlist

  varlist: VAR | varlist ',' VAR

The following, however, has no conflicts. To produce it, all I
did was simply remove all occurences of ',':

  %{

  %}

  %token '('  ')'
  %token INPUT VAR

  %%

  file: '(' list ')'

  list: input | list  input

  input: INPUT varlist

  varlist: VAR | varlist VAR


So a possible solution here is to simply have the lexical analyzer eat
these darn commas.

	Side note: a bigger issue is the social one: stop designing languages
	with superfluous punctuation in them that exists just for the sake of
	punctuation. Furthermore, stop caving in to colleagues who invent
	these things.

The lexical analyzer can have a state flag which tells it whether to
silently consume commas as if they were comments, or return them as
tokens (since there may be areas of the grammar where we need to rely
on the commas).  This can be hooked into the parser:


 file: '('           { lex_eat_commas(1); }
           list ')'  { lex_eat_commas(0); }

Our LALR(1) parser gets to "look ahead" past the comma simply because
the lexer ate it.

Simple. No backtracking, no GLR to blow up the state space.

Cut out the tumor, heal the patient, and prescribe a diet free of syntactic
sugar.