Re: Why are ambiguous grammars usually a bad idea? Why are languages usually defined and implemented with ambiguous grammars?

From	Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups	comp.compilers
Subject	Re: Why are ambiguous grammars usually a bad idea? Why are languages usually defined and implemented with ambiguous grammars?
Date	2021-12-30 18:00 +0000
Organization	A noiseless patient Spider
Message-ID	<21-12-025@comp.compilers> (permalink)
References	<21-12-003@comp.compilers> <21-12-017@comp.compilers> <21-12-020@comp.compilers>
Show all headers | View raw
On 2021-12-30, Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote:
> On Wednesday, December 29, 2021 at 11:28:34 PM UTC+1, Kaz Kylheku wrote:
>> On 2021-12-16, Roger L Costello wrote:
>> > Question: Opine about why languages are usually defined and implemented with
>> > ambiguous grammars.
>> But they aren't.
>>
>> Languages are processed as a stream of characters or tokens, with hidden
>> rules about how those relate together and the meaning that emerges.
>> All of the rules are hidden, including the entire grammar.
>>
>> If you're only aware of some of the hidden rules, but not others, then
>> you see ambiguity.
>>
>> But if you're only aware of some of the hidden rules, but not others,
>> then you are not working with the correct language.
>>
>> For instance, I don't know of any mainstream language in which if/else
>> is actually ambiguous. They have a hidden rule like that the else goes
>> with the closest preceding if statement.
>
> When designing a grammar and implementing a parser: the grammar can
> either be unambiguous by design or unambiguous by accident. The
> viewpoint that "there isn't any mainstream language in which if/else
> is actually ambiguous" is actually the latter option: unambiguous by
> accident.

Languages can be ambiguous at the specification level, even if a
given implementation behaves unambiguouisly:

E.g.:

1. The implementation is completely unambiguous and says that else goes
   with closest preceding if.
2. The documentation says something else. (So the users figure this out
   and get their code working based on the implementation.)
3. (But) a new implementation appears, based on the documentation.

Or:

1. The language spec doesn't say anything about which way something
   is parsed.
2. Mutiple implementations do it willy-nilly.

Clearly, for instance in C, we have semantic ambiguities like
a[i] = i++. (A parser depending on the behavior of something like that
could have a grammar ambiguity: something is parsed in two or more
different ways based on which way the undefined construct behaves. That
would be a defect, of course.)

> A primary reason why grammars in many mainstream languages (that don't
> have a parser generated straight from a verified grammar) are
> unambiguous isn't intentional design, but rather it is a consequence
> of the fact that those parsers are directly implemented in a language
> that is executing statements/expressions/instructions without
> verifying consequences of the executions. Some examples of languages
> with such execution properties: assembly language (such as: i386,
> ARM), C, Haskell.

I don't quite follow this; you seem to be saying that Turing machines
are deterministic, and so if we implement a parser as a Turing process,
it will be "accidentally" unambiguous because of determinism.

However, it may so happen that the lack of ambiguity depends on a whole
lot of context. For instance, a Turing process parsing a language,
proceeding left to right, could decide the rule for "if/then/else" on
a case-by-case basis, influenced by everything it has parsed before.

Suppose you have a language which parses a sequence of top-level
expressions from a terminal or file, and executes each one before moving
to the next. Those expressions could be used to invoke an API in the
language implementation to change the treatment of subsequent syntax.

Sure, everything is still unambiguous, if we take into account the
entire stream of expressions from the beginning and understand its
effect on the parsing machine.

If you don't do anything of this sort, and just write, say, a recursive
descent parser which isn't influenced by any weird state flags (whether
purely internal or external too) that change the parsing, and in a safe,
very high level language in which there are few risks of nonportable or
undefined behaviors, then you end up with a "simply unambiguous"
grammar. You should be able to investigate the behavior of if/else
with several test cases and be confident that the observations hold
in all relevant contexts where that construct can appear.
You might not exactly know what the grammar is for the entire language,
but if you figure it out from the code and accurately document it,
then you're good. (Unless the thing is required to conform to some
external specification and fails.)

> Contrary to the accidental approach, a parser
> generator by design cares about consequences and it is verifying that
> the specification actually is unambiguous despite the fact that in the
> end the parser gets compiled down into machine instructions.
> Verification means to search the whole search space (or at least a
> reasonably large subspace of it) - but asm/C/Haskell will run a search
> only if it is explicitly (step by step) forced by the programmer to
> perform a search.

Yes, e.g. a LALR(1)  shift-reduce parser generator generates the entire
space of LR(0) items that drive the stack machine, and then when it
populates tables, it discovers conflicts there.

With someone's hand-written parser, we cannot be sure without
searching with test cases, which are language instances, which is
intractable to do exhaustively.

Just because if/else is behaving in certain ways in certain test cases
doesn't mean that a new, more complex test case with more/different
context cannot be found in which the if/else behaves differently.

The procedural/recursive parser code doesn't declare to us what the
grammar is. It could be hiding multiple rules for if/else active in
different contexts which differ from each other.

That's not an actual ambiguity, but for the purposes of working with the
language, it's an informal ambiguity (related to my earlier observation
of the user not knowing what all hidden rules are).

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[If you write a hand written recursive descent parser, which was typical
before yacc made it easy to use a LALR parser, it was common
for the language the parser accepted to be somewhat different from the
one the programmer intended to accept due to less than complete checking
for unexpected inputs.
On the other hand, that C example isn't ambiguous, it's deliberately
indeterminate.  Early Fortran allowed the compiler to compile any
mathematically equivalent, not just numerically equivalent, version
of an expression so A*B+A*C could turn into A*(B+C) which was great
for optimizing, not so much for predictable results. -John]
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Thread

csiph-web