Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!news.lightlink.com!news.iecc.com!lnews.iecc.com!nerds-end From: Chris F Clark Newsgroups: comp.compilers Subject: Re: coupling LALR with a scanner? Date: Thu, 29 Sep 2011 00:00:50 -0400 Organization: The World Public Access UNIX, Brookline, MA Lines: 116 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-10-003@comp.compilers> References: <11-07-013@comp.compilers> <11-07-015@comp.compilers> <11-07-018@comp.compilers> <11-08-004@comp.compilers> <11-09-016@comp.compilers> <11-09-017@comp.compilers> <11-09-022@comp.compilers> <11-09-023@comp.compilers> NNTP-Posting-Host: lnews.iecc.com X-Trace: gal.iecc.com 1317502972 34095 64.57.183.34 (1 Oct 2011 21:02:52 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Sat, 1 Oct 2011 21:02:52 +0000 (UTC) Keywords: lex, parse Posted-Date: 01 Oct 2011 17:02:52 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:285 I have come to this discussion a little late and may have missed some of the conversation, so excuse me if I say some things which are obvious and have been discussed before. There are classes of languages where you can use the "parser" context to sufficiently disambiguate which characters should be grouped into tokens. Thus, the class of scannerless LR parsers make sense and there are numerous papers discussing them. It does in general make the front end writer's life simpler in that the programmer doesn't have to manually distinguish the lexical grammar from the parsing grammar. Moreover, several conventions like automatically dealing with whitespace and comments can make this set much larger and much more useful. It might even be possible to argue that the scannerless technology might result in better error messages because a lexical error is promoted to a parsing error and more context can be used to display the correct possibilites (if one assumes that the parser can generate intelligible error messages to begin with, which is often a dubious claim for most automated parsing systems). However, and here I think we come to the crux of the discussion, can a scannerless system disambiguate the potentially ambiguous use of / as a division operator and the start of a regular expression? I would argue that if the system is LR(k) for some finite k and the use of regular expressions is not restricted to some narrow (left, preceding) context (e.g. as in Perl where regular expressions are part of only specific operations like m/regex/ and ~=/regex/ (forgive me if my Perl is a bit rusty)), then the answer is no. The reason why not is simple, at some point one has to decide whether to tokenize a / as a division operator or to use it as introducing a regular expression. With an LR(k) parser, one must make the decision after at most k tokens, but a regular expression can be arbitrarily long, thus one can "easily" formulate a regular expression which is ambiguous with only k tokens of lookahead. Note, I'm positing that tokenization is equivalent to a reduction. However, I think you will find that that is generally true. Note a scannerless system based upon a more general language class (e.g. Earley parsing) could use an arbitrary amount of right, succeeding context to disambiguate the tokenization and avoid this problem. However, the runtime cost of doing so could easily become prohibitive. Alternately, one could design the language to restrict regular expressions to certain contexts, as I believe Perl does, and thus make the cases where one is looking for them unambiguous or atleast much simpler. (To me this would be the hallmark of a good language design, as ambiguity often confuses the programmer as well as the compilers, but that is a digression.) If the language design has the characteristics that regular expressions are restricted to certain well-defined contexts, the ones where lexer start states in a grammar would work, then there is no specific reason why a scannerless system could not find and insert those start states automatically. In theory, then a non-compiler writing guru could use such a tool to work on the grammar without being aware of what is going on under the covers. However, my experience suggests the result will be far less sanguine. Not only would such a tool that further abstracts the grammar writer from the details of the process not really enable non-gurus to successfully write grammars for such subtle cases, but it will lull them into falsely believing that they can write grammars without learning the underlying theory and then when the tool produces an error message they will be so much more confused as to what works and what doesn't that the result will become even more of a black-art with magic incantations and spells that are recited without understanding on the belief that such superstitions enable one to write correct grammars. Despite having written compilers for ~40 years now, I still find myself falling back on overly simplified interpretations on why certain things work and others don't. To wit the hand-waving explanation I offered for context sensitivity above. Although my intuition about context sensitivity is slightly more subtle than what I expressed, it is hard to explain rigorously or precisely. Of course, I don't expect my annecdotal evidence to be convincing so I submit how well C++ template errors are handled. How many people do you know who can write and really understand the creation of non-trivial C++ templates? I would argue that grammar writing is of roughly the same complexity as template writing. Note, I think this also explains the popularity of LL parsing, as the theory of what works in LL is much simpler and manageable because it is so much more restrictive than LR parsing while still being adequate (once one accepts the if-then-else trick) to handle most if not all programming languages. It does take more effort to write such grammars, but novices can quickly write simple grammars in them. In contrast, many users simply give up when trying to resolve reduce- reduce conflicts (or even shift-reduce ones). That is not to say that scannerless parser generators are a bad idea. It is simply to disabuse the notion that it will instantly enable less skilled writers to more successfully develop subtle and complicated grammars, such as those needed to handle ambiguous tokenizations. Over time as we get experienced using them, possibly yes, but out-of-the-box no. Moreover, I also think that understanding the value of a separate lexer and parser is a good thing. The implementation gap from divorcing those two aspects actually has utility. In fact, as I have been studying "high speed regular expression processing", I have determined that even within a regular expression processor (e.g. a lexer) there is an argument for introducing such a gap to "segment" tokens into smaller chunks for both performance and expressibility reasons. At the same time, I think that the segmentation process can be automated, at least in many cases. Hope this helps, -Chris ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris