Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.linkpendium.com!news.linkpendium.com!news.iecc.com!nerds-end
From: "BartC" <bc@freeuk.com>
Newsgroups: comp.compilers
Subject: Re: Recursive descent parsing and optimization, was Good practical language and OS agnostic text?
Date: Sun, 22 Apr 2012 12:51:44 +0100
Organization: A noiseless patient Spider
Lines: 83
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <12-04-066@comp.compilers>
References: <12-04-019@comp.compilers> <12-04-056@comp.compilers> <12-04-060@comp.compilers>
NNTP-Posting-Host: news.iecc.com
X-Trace: leila.iecc.com 1335104973 17164 64.57.183.58 (22 Apr 2012 14:29:33 GMT)
X-Complaints-To: abuse@iecc.com
NNTP-Posting-Date: Sun, 22 Apr 2012 14:29:33 +0000 (UTC)
Keywords: parse, design
Posted-Date: 22 Apr 2012 10:29:33 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:609

"BGB" <cr88192@hotmail.com> wrote in message
> On 4/21/2012 2:22 AM, Uli Kusterer wrote:

>> - Recursive descent parsers: It's the obvious way to write a parser.

> although I use recursive descent, the above sounds different from what I
> usually do.

> ReadStatement:
>   checks for and handles vaious statement types
>   calls ReadExpression
>
> ReadExpression:
>   (actually, this is a tower of functions, one for each precedence
> level, working from lowest to highest precedence)

Sounds like it's influenced by the C grammar, which defines expressions
using something like 13 or 17 layers of precedence.

Beyond about 3-4 levels, I found that unmanageable. For expression syntax, I
don't use any precedence in the grammar at all; I have precedence as an
attribute of an operator, and an expression can be parsed with a single
function.

Or rather two: readfactor(priority), and readterm(). Readfactor() deals with
the  binary operators linking successive terms, while readterm() does all
the real work (since my syntax doesn't distinguish between expressions and
statements, that's quite a big workload).

>> - Tokenizing: Essentially grab all the words in your source text and
>> build an array with an entry for each so you can more quickly walk
>> forward and backward without having to scan individual characters.

> partly agreed, except that it isn't really strictly necessary to build
> an array up-front.

> most of my parsers don't bother using an array, but instead just call
> the tokenizer function directly to peek and parse tokens based on the
> current "stream position" (generally a raw char pointer in my language
> parsers).

I've tried reading all the tokens in a separate pass, but didn't really like
it. And it takes a lot more storage as well, especially with macro
expansions.

Instead I read them as I go along, but with provision for a one-symbol
look-ahead.

>> - Syntax tree: Build a tree structure that represents the parsed program.
>> You

> typically, things like simplifying expressions, propagating constants,
> ... is done in a stage I have typically called "reduction", which
> happens between parsing and prior to producing (bytecode) output.

I use the following passes (which seem to be fairly typical):

Syntax analysis (lexing and parsing)
Name resolution (a recent introduction for me)
Type analysis (static type checks and coercions, constant folding)
Code generation (to intermediate code or to byte-code)
Final pass (from intermediate code to the target code)

Usually invoked one after the other for the entire module, where a
compile-time expressions is needed, then the first three passes have to be
done immediately (and the result had better be a constant value..)

I use the same structure now when generating byte-code (originally such a
compiler was just single-pass). Because such code is usually dynamically
typed, the type analysis pass only needs a nominal amount of work, but still
takes care of a few things (l-values for example).

> some past experiments (special purpose tests), have implied that
> potentially threaded code can pull off "reasonably solid" interpreter
> performance (potentially within 3-5x of native).

Assuming you're talking about dynamic typing, I found it difficult to get
within 3-5x, unless some sort of type hinting is used, or perhaps you're
comparing with not too highly optimised native code. Or it's code that is
memory-intensive, then memory access will dominate.

--
Bartc