Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!news.lightlink.com!news.iecc.com!lnews.iecc.com!nerds-end
From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Subject: Re: coupling LALR with a scanner?
Date: Thu, 29 Sep 2011 00:00:50 -0400
Organization: The World Public Access UNIX, Brookline, MA
Lines: 116
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <11-10-003@comp.compilers>
References: <11-07-013@comp.compilers> <11-07-015@comp.compilers> <11-07-018@comp.compilers> <11-08-004@comp.compilers> <11-09-016@comp.compilers> <11-09-017@comp.compilers> <11-09-022@comp.compilers> <11-09-023@comp.compilers>
NNTP-Posting-Host: lnews.iecc.com
X-Trace: gal.iecc.com 1317502972 34095 64.57.183.34 (1 Oct 2011 21:02:52 GMT)
X-Complaints-To: abuse@iecc.com
NNTP-Posting-Date: Sat, 1 Oct 2011 21:02:52 +0000 (UTC)
Keywords: lex, parse
Posted-Date: 01 Oct 2011 17:02:52 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: x330-a1.tempe.blueboxinc.net comp.compilers:285

I have come to this discussion a little late and may have missed some
of the conversation, so excuse me if I say some things which are
obvious and have been discussed before.

There are classes of languages where you can use the "parser" context
to sufficiently disambiguate which characters should be grouped into
tokens.  Thus, the class of scannerless LR parsers make sense and
there are numerous papers discussing them.  It does in general make
the front end writer's life simpler in that the programmer doesn't
have to manually distinguish the lexical grammar from the parsing
grammar.  Moreover, several conventions like automatically dealing
with whitespace and comments can make this set much larger and much
more useful.

It might even be possible to argue that the scannerless technology
might result in better error messages because a lexical error is
promoted to a parsing error and more context can be used to display
the correct possibilites (if one assumes that the parser can generate
intelligible error messages to begin with, which is often a dubious
claim for most automated parsing systems).

However, and here I think we come to the crux of the discussion, can a
scannerless system disambiguate the potentially ambiguous use of / as
a division operator and the start of a regular expression?  I would
argue that if the system is LR(k) for some finite k and the use of
regular expressions is not restricted to some narrow (left, preceding)
context (e.g. as in Perl where regular expressions are part of only
specific operations like m/regex/ and ~=/regex/ (forgive me if my Perl
is a bit rusty)), then the answer is no.

The reason why not is simple, at some point one has to decide whether
to tokenize a / as a division operator or to use it as introducing a
regular expression.  With an LR(k) parser, one must make the decision
after at most k tokens, but a regular expression can be arbitrarily
long, thus one can "easily" formulate a regular expression which is
ambiguous with only k tokens of lookahead.  Note, I'm positing that
tokenization is equivalent to a reduction.  However, I think you will
find that that is generally true.

Note a scannerless system based upon a more general language class
(e.g. Earley parsing) could use an arbitrary amount of right,
succeeding context to disambiguate the tokenization and avoid this
problem.  However, the runtime cost of doing so could easily become
prohibitive.

Alternately, one could design the language to restrict regular
expressions to certain contexts, as I believe Perl does, and thus make
the cases where one is looking for them unambiguous or atleast much
simpler.  (To me this would be the hallmark of a good language design,
as ambiguity often confuses the programmer as well as the compilers,
but that is a digression.)  If the language design has the
characteristics that regular expressions are restricted to certain
well-defined contexts, the ones where lexer start states in a grammar
would work, then there is no specific reason why a scannerless system
could not find and insert those start states automatically.  In
theory, then a non-compiler writing guru could use such a tool to work
on the grammar without being aware of what is going on under the
covers.

However, my experience suggests the result will be far less sanguine.
Not only would such a tool that further abstracts the grammar writer
from the details of the process not really enable non-gurus to
successfully write grammars for such subtle cases, but it will lull
them into falsely believing that they can write grammars without
learning the underlying theory and then when the tool produces an
error message they will be so much more confused as to what works and
what doesn't that the result will become even more of a black-art with
magic incantations and spells that are recited without understanding
on the belief that such superstitions enable one to write correct
grammars.  Despite having written compilers for ~40 years now, I still
find myself falling back on overly simplified interpretations on why
certain things work and others don't.  To wit the hand-waving
explanation I offered for context sensitivity above.  Although my
intuition about context sensitivity is slightly more subtle than what
I expressed, it is hard to explain rigorously or precisely.

Of course, I don't expect my annecdotal evidence to be convincing so I
submit how well C++ template errors are handled.  How many people do
you know who can write and really understand the creation of
non-trivial C++ templates?  I would argue that grammar writing is of
roughly the same complexity as template writing.

Note, I think this also explains the popularity of LL parsing, as the
theory of what works in LL is much simpler and manageable because it
is so much more restrictive than LR parsing while still being adequate
(once one accepts the if-then-else trick) to handle most if not all
programming languages.  It does take more effort to write such
grammars, but novices can quickly write simple grammars in them.  In
contrast, many users simply give up when trying to resolve reduce-
reduce conflicts (or even shift-reduce ones).

That is not to say that scannerless parser generators are a bad idea.
It is simply to disabuse the notion that it will instantly enable less
skilled writers to more successfully develop subtle and complicated
grammars, such as those needed to handle ambiguous tokenizations.
Over time as we get experienced using them, possibly yes, but
out-of-the-box no.

Moreover, I also think that understanding the value of a separate
lexer and parser is a good thing.  The implementation gap from
divorcing those two aspects actually has utility.  In fact, as I have
been studying "high speed regular expression processing", I have
determined that even within a regular expression processor (e.g. a
lexer) there is an argument for introducing such a gap to "segment"
tokens into smaller chunks for both performance and expressibility
reasons.  At the same time, I think that the segmentation process can
be automated, at least in many cases.

Hope this helps,
-Chris

******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris