Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!news.iecc.com!nerds-end
From: Uli Kusterer <ulimakesacompiler@googlemail.com>
Newsgroups: comp.compilers
Subject: Re: Good practical language and OS agnostic text?
Date: Sat, 21 Apr 2012 11:22:01 +0200
Organization: Compilers Central
Lines: 78
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <12-04-056@comp.compilers>
References: <12-04-019@comp.compilers>
NNTP-Posting-Host: news.iecc.com
X-Trace: leila.iecc.com 1335043010 21913 64.57.183.58 (21 Apr 2012 21:16:50 GMT)
X-Complaints-To: abuse@iecc.com
NNTP-Posting-Date: Sat, 21 Apr 2012 21:16:50 +0000 (UTC)
Keywords: books, comment
Posted-Date: 21 Apr 2012 17:16:49 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:599

On 17.04.2012, at 23:28, compilers@is-not-my.name wrote:
> Guys, I'm having a bear of a time finding a good practical language
> and OS agnostic text on writing a compiler. I'm weak in math and not
> interested in the theoretical details. I want to understand the hows
> and whys of compiler writing.

Hi,

 I usually do C-style programming languages, and I'm by no means a
professional compiler writer, but I enjoy parsers, so I've been doing
stuff like this for ages and sponging up any bit of knowledge that
sounded useful.

 IMO if you know assembler or BASIC and general algorithms (i.e. you
could implement a binary tree and walk its nodes), and you can somehow
figure out what bytes your code gets compiled to (at worst by writing
a dummy assembler program and looking at what bytes show up between a
bunch of nop instructions whose bytes you know), you should be able to
at least get a basic working knowledge of how a compiler works. Just
write the naove approach of how you would design any program.

 The things that I benefited most from learning about compilers were:

- Recursive descent parsers: It's the obvious way to write a parser.
You write a function ReadProgram(), which calls ReadLine() in a loop,
which looks at the first word and then calls ReadIfStatement() if it's
"if" etc. If you find you go wrong, you add a way to either get back
to the last known good state (backtracking) or you just give an error
message. On the way you keep track of the variables in each function
and can later calculate their stack offsets once you know how many you
need etc.

- Tokenizing: Essentially grab all the words in your source text and
build an array with an entry for each so you can more quickly walk
forward and backward without having to scan individual characters. You
can also attach additional information to a token, i.e. an ID number
so you can quickly distinguish user-defined identifiers from known
identifiers like "if" or "repeat". Keep around a token's start offset
so you can report errors.

- Syntax tree: Build a tree structure that represents the parsed program. You
can then do things like evaluate constant nodes ahead of time (e.g. turn 5+4
into 9, but leave 5 + n as an addition node with a 5 and a reference to
variable 'n') by replacing groups of nodes with simpler nodes recursively.
Keep around the line number/first token offset in each node so you can report
errors.

- Code generation: A good starting point I've found is to generate a single
function. Write a little application that takes a file containing the raw
bytes of your function, load the pointer (mark it as executable if your
platform requires that), and then just call it. You write the prolog, the
epilog, and the raw instructions in between, and maybe do some labels. You may
have to remember the offset of a particular field and fill in the address
later on (e.g. for a return statement the offset to the epilog, so you don't
duplicate your clean-up code). Then advance to two functions that call each
other in the same block etc.

- Complete code generation: you generate several functions and put them in a
proper executable file, the way your OS expects. You may have to generate link
table entries ("trampolines") to call dynamically-linked system functions etc.
If you want to start simpler, write your own primitive linker just for the fun
of seeing how two functions that don't reside at fixed (relative) addresses
could call each other.

I can't really say anything about optimization on the compiled code level. I
suppose you'd build a data structure with additional information about each
instruction and then loop over it, looking for patterns that you know can be
simplified. Essentially the same as any other search-and-replace.

Is that un-computer-sciencey enough? This blog post may help with the basics
of my code generation approach:
http://orangejuiceliberationfront.com/generating-machine-code-at-runtime/ (but
it's C, and Intel, and badly wrapped by the new theme of my blog).

Cheers,
-- Uli Kusterer
http://stacksmith.com
[This is a really good summary. Thanks! -John]