Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.lightlink.com!news.iecc.com!lnews.iecc.com!nerds-end From: Kaz Kylheku Newsgroups: comp.compilers Subject: Re: shift/reduce problem Date: Thu, 20 Oct 2011 05:47:26 +0000 (UTC) Organization: A noiseless patient Spider Lines: 92 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-10-013@comp.compilers> References: <10-12-004@comp.compilers> NNTP-Posting-Host: lnews.iecc.com X-Trace: gal.iecc.com 1319105744 72711 64.57.183.34 (20 Oct 2011 10:15:44 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Thu, 20 Oct 2011 10:15:44 +0000 (UTC) Keywords: parse Posted-Date: 20 Oct 2011 06:15:44 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:295 On 2010-12-02, JohnSmith wrote: > I have to parse the following string like this > > "(input i, j, k, input u, v, w)" (In case anyone still cares about this nearly one year old thread. ) The crux of the problem here is caused by the comma. The language in this parenthesized construct is actually a simple regular language, as witnessed when we strip away the annoying commas, and Lispify it a little bit: (:input i j k :input u v w) Yacc chokes on the original because the comma is overloaded with two meanings: variable separator and segment separator. Furthermore, the comma occupies a precious lookahead position. You don't know which comma you are dealing with, and you can't look beyond it at the other tokens which would tell you. Instead of the next token being either INPUT or a variable symbol, there is a comma there. The state machine doesn't know what to do. Is it the comma which should reduce "input i j k" to an input nonterminal? Or is it a variable separating comma after which we can expect another variable? Here is a Yacc file for a similar grammar with a shift/reduce conflict: %{ %} %token '(' ',' ')' %token INPUT VAR %% file: '(' list ')' list: input | list ',' input input: INPUT varlist varlist: VAR | varlist ',' VAR The following, however, has no conflicts. To produce it, all I did was simply remove all occurences of ',': %{ %} %token '(' ')' %token INPUT VAR %% file: '(' list ')' list: input | list input input: INPUT varlist varlist: VAR | varlist VAR So a possible solution here is to simply have the lexical analyzer eat these darn commas. Side note: a bigger issue is the social one: stop designing languages with superfluous punctuation in them that exists just for the sake of punctuation. Furthermore, stop caving in to colleagues who invent these things. The lexical analyzer can have a state flag which tells it whether to silently consume commas as if they were comments, or return them as tokens (since there may be areas of the grammar where we need to rely on the commas). This can be hooked into the parser: file: '(' { lex_eat_commas(1); } list ')' { lex_eat_commas(0); } Our LALR(1) parser gets to "look ahead" past the comma simply because the lexer ate it. Simple. No backtracking, no GLR to blow up the state space. Cut out the tumor, heal the patient, and prescribe a diet free of syntactic sugar.