Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!news.iecc.com!nerds-end From: Uli Kusterer Newsgroups: comp.compilers Subject: Re: Good practical language and OS agnostic text? Date: Sat, 21 Apr 2012 11:22:01 +0200 Organization: Compilers Central Lines: 78 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <12-04-056@comp.compilers> References: <12-04-019@comp.compilers> NNTP-Posting-Host: news.iecc.com X-Trace: leila.iecc.com 1335043010 21913 64.57.183.58 (21 Apr 2012 21:16:50 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Sat, 21 Apr 2012 21:16:50 +0000 (UTC) Keywords: books, comment Posted-Date: 21 Apr 2012 17:16:49 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:599 On 17.04.2012, at 23:28, compilers@is-not-my.name wrote: > Guys, I'm having a bear of a time finding a good practical language > and OS agnostic text on writing a compiler. I'm weak in math and not > interested in the theoretical details. I want to understand the hows > and whys of compiler writing. Hi, I usually do C-style programming languages, and I'm by no means a professional compiler writer, but I enjoy parsers, so I've been doing stuff like this for ages and sponging up any bit of knowledge that sounded useful. IMO if you know assembler or BASIC and general algorithms (i.e. you could implement a binary tree and walk its nodes), and you can somehow figure out what bytes your code gets compiled to (at worst by writing a dummy assembler program and looking at what bytes show up between a bunch of nop instructions whose bytes you know), you should be able to at least get a basic working knowledge of how a compiler works. Just write the naove approach of how you would design any program. The things that I benefited most from learning about compilers were: - Recursive descent parsers: It's the obvious way to write a parser. You write a function ReadProgram(), which calls ReadLine() in a loop, which looks at the first word and then calls ReadIfStatement() if it's "if" etc. If you find you go wrong, you add a way to either get back to the last known good state (backtracking) or you just give an error message. On the way you keep track of the variables in each function and can later calculate their stack offsets once you know how many you need etc. - Tokenizing: Essentially grab all the words in your source text and build an array with an entry for each so you can more quickly walk forward and backward without having to scan individual characters. You can also attach additional information to a token, i.e. an ID number so you can quickly distinguish user-defined identifiers from known identifiers like "if" or "repeat". Keep around a token's start offset so you can report errors. - Syntax tree: Build a tree structure that represents the parsed program. You can then do things like evaluate constant nodes ahead of time (e.g. turn 5+4 into 9, but leave 5 + n as an addition node with a 5 and a reference to variable 'n') by replacing groups of nodes with simpler nodes recursively. Keep around the line number/first token offset in each node so you can report errors. - Code generation: A good starting point I've found is to generate a single function. Write a little application that takes a file containing the raw bytes of your function, load the pointer (mark it as executable if your platform requires that), and then just call it. You write the prolog, the epilog, and the raw instructions in between, and maybe do some labels. You may have to remember the offset of a particular field and fill in the address later on (e.g. for a return statement the offset to the epilog, so you don't duplicate your clean-up code). Then advance to two functions that call each other in the same block etc. - Complete code generation: you generate several functions and put them in a proper executable file, the way your OS expects. You may have to generate link table entries ("trampolines") to call dynamically-linked system functions etc. If you want to start simpler, write your own primitive linker just for the fun of seeing how two functions that don't reside at fixed (relative) addresses could call each other. I can't really say anything about optimization on the compiled code level. I suppose you'd build a data structure with additional information about each instruction and then loop over it, looking for patterns that you know can be simplified. Essentially the same as any other search-and-replace. Is that un-computer-sciencey enough? This blog post may help with the basics of my code generation approach: http://orangejuiceliberationfront.com/generating-machine-code-at-runtime/ (but it's C, and Intel, and badly wrapped by the new theme of my blog). Cheers, -- Uli Kusterer http://stacksmith.com [This is a really good summary. Thanks! -John]