Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: RE: Why no shift-shift conflicts. Date: Sun, 30 Jan 2022 03:14:05 +0200 Organization: Compilers Central Lines: 130 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-01-120@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="17138"; mail-complaints-to="abuse@iecc.com" Keywords: parse, LR(1) Posted-Date: 29 Jan 2022 20:41:01 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2877 Our esteemed moderator was right in saying that this is part of how the LR family of algorithms work. In state machines, you can build a state that represents two (or more) rules (productions). So, shift-shift conflicts don't occur in the LR machine. You only care about conflicts when you need to reduce. More on that a bit later in this reply. In LL(k) algorithms, you don't get to do that (past the k-th token of lookahead at the beginning of the rule. So, when your LL(k) parser generator tells you that you need to left-factor your two rules. And, it usually does so by declaring your grammar as "ambiguous" which it might not be, just not LL(k). It has actually declared a shift-shift conflict, just not labelling it as that, as the LL people don't think of advancing the token index through the rule as shifting (although in an LR parser, that's what it is called because of how the stack is handled in LR parsing). Next, I want to talk about this example: X: '1' { a = a + 1; } '2' 'x'; Y: '1' { a = a - 1; } '2' 'y'; Z: '1' '2' 'z' { a = 0 }; [The mid-rule action is a cheat, a shorthand for this: X: '1' x1 '2' 'x' ; x1: { a = a + 1; } ; Y: '1' y1 '2' 'y' ; y1: { a = a - 1; } ; That's why those actions create conflicts where there were none before. -John] The cheat as our moderator calls it, is *somewhat* an artifact of the traditional yacc/bison implementation of it. You can build a state machine model of the rules (we do in "Yacc++", the one by compiler resources as opposed to the C++ port of yacc, which is often refered to by the same name although not usually in initial cap spelling. I only mention the distinction lest someone use the latter and say, wait it doesn't do that) where you haven't introduced those artificial non-terminals (x1 and y1) and thus don't have to reduce them. If so, you can make the conflict "go away" at least in terms of what the parser needs to do. However, that in itself is somewhat misleading. In that, while you can make that work at the grammar level, you will have changed when the action code runs. You will postpone running it until after the lexer has processed and returned to the parser both the "2" token and the "x" or "y" token following it. If one is depending upon that action to influence how the lexer works (e.g. changing the lexer state or something similar), the action will occur too late. Thus, although Yacc++ can move the action code later in the processing of the parser's state machine, it still wants to report a warning/error to tell the user that the action had to be delayed as the two actions were in conflict when it shifted the "1" token. Now, we don't call this a shift-shift conflict, but rather an "action code conflict" because it is the two actions that are in conflict, not the shifting. And, worth noting, if the two actions are the same (textually), we don't even move it, because both transitions want to do the same thing and there is no "action code conflict". Now, getting back to LL(k) parsers, they mask this problem. Because before they start running a rule, they get to look-ahead (i.e. read from the lexer) up to k tokens, so they will get the "1" "2" and "x" or "y" tokens (k=3) before running the rule at all. Thus, they have the same issue in interacting with the lexer, but haven't told the grammar writer about it. Now, about conflicts at reduce time. At reduce time, an LR(k) parser gets to read up to k tokens of lookahead before deciding whether to reduce or not. If the generator is LALR(k), there are some additional restrictions on that, but not that important to what I am going to say. So, in this case, a LR parser generator can look k tokens past the end of the rule (or past the action code, with the above mentioned caveat) before declaring a conflict. And, particularly worth noting are the GLR variants of the LR algorithm. In that case, the parser generator can simply generate a parse forest, effectively postponing the decision (potentially until the entire file has been processed) and only then noting that one still has a forest and not a tree that the grammar is ambiguous. Now, this has the positive effect of eliminating spurious conflict messages, because the parser can handle the grammar by generating a parse forest if needed. But, it also has the disadvantage of removing the check that at parser generation time the grammar has been detected as being potentially ambiguous. Now, for some grammars, this can be resolved and the grammar be shown to be ambiguous or not. However, in general, such detection is equivalent to solving the halting problem. (The proof of that being true is beyond what I can explain.) Finally, I want to mention something about Syntactic (and Semantic) predicates, as they were a part of this discussion earlier. They are somewhat an alternate solution to this problem. In fact, with the right model, they can be used to turn (even an LL) parser generator into a "universal turing machine", because they allow one to write backtracking parsers (with unlimited backtracking) and the allow one to effectively give oneself unlimited look-ahead before deciding which rule to apply and at the same time fine grained control over the order the rules are examined in. This is perhaps Terence Parr's greatest invention. In fact, it is so important that a whole new branch of parsing technologies "Parsing Expression Grammars" (PEGs) have evolved from it. Now, without the proper checks, it still has the problem of allowing ambiguous grammars, but with the twist that it guarantees an unambiguous parse tree (not a parse forest). Now, at some point, all of these advances in parsing technology will be merged (it is on my to do list, to do so, but I keep having other things to finish first). If done right, one potentially has a parser generator that will warn you if your grammar is ambiguous and yet allow you to process it in an unambiguous manner or get out a parse forest of all the possible interpretations (with fine grained control to get exactly the level of ambiguity you want handled in each manner). And, if really done right, it will give you something that looks like hand- written recursive descent code in most cases, and will allow you to edit that code and get the grammar tweaks back with the appropriate sanity checks. We have the theory to do so, but it isn't a one month project, so it remains just a dream. Of course, I won't be surprised if there are still people hacking on hand-written recursive descent parsers even if the dream comes to fruition. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------