Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #722 > unrolled thread
| Started by | hsad005@gmail.com |
|---|---|
| First post | 2012-08-17 11:22 -0700 |
| Last post | 2012-08-18 02:09 -0700 |
| Articles | 15 — 6 participants |
Back to article view | Back to comp.compilers
Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support hsad005@gmail.com - 2012-08-17 11:22 -0700
Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-18 10:13 +0100
Re: lexer speed, was Bison Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-20 01:01 +0100
Re: lexer speed, was Bison Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-20 16:14 +0100
Re: lexer speed, was Bison BGB <cr88192@hotmail.com> - 2012-08-20 14:14 -0500
Re: lexer speed, was Bison Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-21 07:40 +0100
Re: lexer speed, was Bison "BartC" <bc@freeuk.com> - 2012-08-21 17:39 +0100
Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-08-20 13:35 +0000
Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support BGB <cr88192@hotmail.com> - 2012-08-21 14:45 -0500
Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support "BartC" <bc@freeuk.com> - 2012-08-22 14:04 +0100
Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support BGB <cr88192@hotmail.com> - 2012-08-26 19:37 -0500
Bison deterministic LALR parser for Java/C++ "BartC" <bc@freeuk.com> - 2012-08-29 22:03 +0100
speeding up C recompilation, was Re: Bison deterministic LALR BGB <cr88192@hotmail.com> - 2012-09-04 13:45 -0500
Re: C include handling, was Bison deterministic LALR Marco van de Voort <marcov@toad.stack.nl> - 2012-09-05 08:40 +0000
Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support hsad005@gmail.com - 2012-08-18 02:09 -0700
| From | hsad005@gmail.com |
|---|---|
| Date | 2012-08-17 11:22 -0700 |
| Subject | Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support |
| Message-ID | <12-08-005@comp.compilers> |
I need to write a parser for a programming langauge which is as complex as C++/Java, and to even complicate the matter, there are constructs in this langauge that doesn't allow me to use type/identifier dis-ambiguating lexer hack. In other words, I will have to return just one lexical token (say IDENTIFIER) from the lexer for both type references as well as non-type variable references. Given these restrictions, I was wondering if it would be a good idea to pick yacc/bison for my parser...? Or, should I consider a hand written recursive descent parser. Regards. [Get it working in bison, then in the unlikely event that's not fast enough, profile your compiler to see where it's spending its time and fix what needs to be fixed. Although in theory GLR can be very slow, in practice the ambiguities are generally resolved within a few tokens and the performance is fine. compilers always spend way more time in the lexer than the parser anyway. Writing RD parsers by hand can be fun, but you never know what language it actually parses. -John]
[toc] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@aol.com> |
|---|---|
| Date | 2012-08-18 10:13 +0100 |
| Message-ID | <12-08-006@comp.compilers> |
| In reply to | #722 |
hsad005@gmail.com schrieb: > I need to write a parser for a programming langauge which is as > complex as C++/Java, and to even complicate the matter, there are > constructs in this langauge that doesn't allow me to use > type/identifier dis-ambiguating lexer hack. Why don't you fix your language, and remove such ambiguities? Look at Pascal or other Wirthian languages... > In other words, I will > have to return just one lexical token (say IDENTIFIER) from the lexer > for both type references as well as non-type variable references. This shouldn't be a big problem, as long as the parser does not rely on such a distinction. Once a symbol has been defined, it can contain some indication about its nature. > Given these restrictions, I was wondering if it would be a good idea > to pick yacc/bison for my parser...? Or, should I consider a hand > written recursive descent parser. I don't see how this decision is related to above problem. > Regards. > [Get it working in bison, then in the unlikely event that's not fast > enough, profile your compiler to see where it's spending its time and > fix what needs to be fixed. Although in theory GLR can be very slow, > in practice the ambiguities are generally resolved within a few tokens > and the performance is fine. compilers always spend way more time in > the lexer than the parser anyway. Writing RD parsers by hand can be > fun, but you never know what language it actually parses. -John] There exist parser generators for several models. I also doubt that - except in misdesigned C-ish languages - a compiler spends significant time in the lexer. This may be true for dummy parsers, which do nothing but syntax checks, but not for compilers with code generation, optimization and more. DoDi [Compilers spend a lot of time in the lexer, because that's the only phase that has to look at the input one character at a time. -John]
[toc] | [prev] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@aol.com> |
|---|---|
| Date | 2012-08-20 01:01 +0100 |
| Subject | Re: lexer speed, was Bison |
| Message-ID | <12-08-008@comp.compilers> |
| In reply to | #723 |
> [Compilers spend a lot of time in the lexer, because that's the only > phase that has to look at the input one character at a time. -John] When the source code resides in a memory buffer, the time for reading e.g. the characters of an identifier (in the lexer) is neglectable vs. the time spent in lookup and entering the identifier into a symbol table (in the parser). Even if a lexer reads single characters from a file, most OSs maintain their own file buffer, so that little overhead is added over the program-buffered solution. I really would like to see some current benchmarks about the behaviour of current compilers and systems. DoDi [The benchmarks I did were a while ago, but they showed a large fraction of time in the lexer. I wouldn't disagree that building the symbol table is slow, but figure out some estimate of the ratio of the number of characters in a source file to the number of tokens, and that is a rough estimate of how much slower the lexer will be than the parser. I agree that some current analyses would be useful. -John]
[toc] | [prev] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@aol.com> |
|---|---|
| Date | 2012-08-20 16:14 +0100 |
| Subject | Re: lexer speed, was Bison |
| Message-ID | <12-08-010@comp.compilers> |
| In reply to | #725 |
Hans-Peter Diettrich schrieb: >> [Compilers spend a lot of time in the lexer, because that's the only >> phase that has to look at the input one character at a time. -John] > > When the source code resides in a memory buffer, the time for reading > e.g. the characters of an identifier (in the lexer) is neglectable vs. > the time spent in lookup and entering the identifier into a symbol table > (in the parser). > > Even if a lexer reads single characters from a file, most OSs maintain > their own file buffer, so that little overhead is added over the > program-buffered solution. > > I really would like to see some current benchmarks about the behaviour > of current compilers and systems. > > DoDi > [The benchmarks I did were a while ago, but they showed a large > fraction of time in the lexer. I wouldn't disagree that building the > symbol table is slow, but figure out some estimate of the ratio of > the number of characters in a source file to the number of tokens, > and that is a rough estimate of how much slower the lexer will be > than the parser. I agree that some current analyses would be useful. > -John] Your estimation suggests that the benchmarked parser does not much more than syntax checking the tokens. In this case I agree that a parser automaton does less work (state transitions) than a lexer automaton (or equivalents), and that detailed benchmarking is not required to proof this fact. I still disagree that "Compilers spend a lot of time in the lexer". My point is that a real *compiler* does not normally spend significant time in neither the lexer nor the parser[1]. In so far it's okay to "profile your compiler to see where it's spending its time and fix what needs to be fixed", as you said :-) [1] The timing depends heavily on what tasks are assigned to the parser itself. Is code generation part of the parser, or part of a subsequent (optimization...) stage? What about symbol table (scopes) management and lookup, and an eventual preprocessor which IMO in an C compiler consumes more time than the lexer and parser altogether? But further discussion of these academic issues doesn't help in spotting the real bottlenecks of any concrete compiler ;-) DoDi [Perhaps you could do some experiments and tell us whether it confirms your intuition. Like I said, when I profiled some compilers, I was surprised to see how much time they spent grinding through the characters in the input. -John]
[toc] | [prev] | [next] | [standalone]
| From | BGB <cr88192@hotmail.com> |
|---|---|
| Date | 2012-08-20 14:14 -0500 |
| Subject | Re: lexer speed, was Bison |
| Message-ID | <12-08-011@comp.compilers> |
| In reply to | #725 |
On 8/19/2012 7:01 PM, Hans-Peter Diettrich wrote: >> [Compilers spend a lot of time in the lexer, because that's the only >> phase that has to look at the input one character at a time. -John] > > When the source code resides in a memory buffer, the time for reading > e.g. the characters of an identifier (in the lexer) is neglectable vs. > the time spent in lookup and entering the identifier into a symbol table > (in the parser). > > Even if a lexer reads single characters from a file, most OSs maintain > their own file buffer, so that little overhead is added over the > program-buffered solution. > > I really would like to see some current benchmarks about the behaviour > of current compilers and systems. My experiences are similar to those John is describing. Basically, given the lexer/tokenizer processes things a character at a time, it itself can end up being the bulk of the parsing time. Usually this is only really noticable though in cases where the tokenizer has to deal with a fairly large amount of text, such as what happens when writing a C parser (and suddenly one may find itself with 10+ MB of preprocessor output to churn through). My assembler had a similar issue, although it is generally fast enough (and ASM code is typically small enough) that this isn't really a major issue. granted, for a C style parser, how this compares with having to check whether or not an identifier is a type-name, is a secondary issue. things like lookups can be potentially fairly expensive, but this can be largely resolved via the use of hash tables, say, using a hash table to identify keywords (mapping the name to a keyword ID-number or similar) and lookup type-names (1). my scripting language partly avoids this issue in the parser by making most statements start with a keyword (pretty much anything that doesn't start with a keyword is an expression), and identifying type-names by the use of the language syntax (making them functionally no different than normal identifiers). (OTOH... the script-language parser is a bit less efficient, and works primarily via identifying statement types via if/else logic and using "!strcmp()" to match keywords). 1: actually, my C parser uses a pretty big hash for type-name lookup, namely 16k entry IIRC, whereas for most other things 1024 or 4096 entry is plenty sufficient (for a chain-hash). the main reason for this is the large number of typedefs in headers like "windows.h", which can easily saturate a 1024 or 4096 entry hash. I have used bigger hash tables though, namely for things like interning strings (64k) and dynamic+interface method-dispatch (256k, but this one is an open-address hash). > DoDi > [The benchmarks I did were a while ago, but they showed a large > fraction of time in the lexer. I wouldn't disagree that building the > symbol table is slow, but figure out some estimate of the ratio of > the number of characters in a source file to the number of tokens, > and that is a rough estimate of how much slower the lexer will be > than the parser. I agree that some current analyses would be useful. > -John] > yep. can't compare exactly, as my parsers tend to be recursive-descent and build ASTs directly.
[toc] | [prev] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@aol.com> |
|---|---|
| Date | 2012-08-21 07:40 +0100 |
| Subject | Re: lexer speed, was Bison |
| Message-ID | <12-08-012@comp.compilers> |
| In reply to | #728 |
BGB schrieb: > 1: actually, my C parser uses a pretty big hash for type-name lookup, > namely 16k entry IIRC, whereas for most other things 1024 or 4096 entry > is plenty sufficient (for a chain-hash). The global (top level) symbol table can become very big, in detail in C compilers. A separate table for type names is not required. > the main reason for this is the > large number of typedefs in headers like "windows.h", which can easily > saturate a 1024 or 4096 entry hash. Right. AFAIR I added capabilities to skip already processed header files, at least when a guard is detected. Such (not fully conforming) behaviour can influence the time spent with loading the same files moreoften. > I have used bigger hash tables though, namely for things like interning > strings (64k) and dynamic+interface method-dispatch (256k, but this one > is an open-address hash). > > >> DoDi >> [The benchmarks I did were a while ago, but they showed a large >> fraction of time in the lexer. I wouldn't disagree that building the >> symbol table is slow, but figure out some estimate of the ratio of >> the number of characters in a source file to the number of tokens, >> and that is a rough estimate of how much slower the lexer will be >> than the parser. I agree that some current analyses would be useful. >> -John] >> > > yep. > > can't compare exactly, as my parsers tend to be recursive-descent and > build ASTs directly. The same for my compilers. My C to Pascal converter uses an LL(1) parser, with only one exception where LL(2) lookahead is required. Expressions are parsed differently, using an table-driven parser for operator precedence and associativity. The preprocessor is implemented in a couple of filter stages between the lexer and parser. P.S.: When the time for loading source files can be separated from other lexer tasks, I'd expect that this operation takes most of the time. DoDi
[toc] | [prev] | [next] | [standalone]
| From | "BartC" <bc@freeuk.com> |
|---|---|
| Date | 2012-08-21 17:39 +0100 |
| Subject | Re: lexer speed, was Bison |
| Message-ID | <12-08-013@comp.compilers> |
| In reply to | #725 |
"Hans-Peter Diettrich" <DrDiettrich1@aol.com> wrote in message >> [Compilers spend a lot of time in the lexer, because that's the only >> phase that has to look at the input one character at a time. -John] > > When the source code resides in a memory buffer, the time for reading > e.g. the characters of an identifier (in the lexer) is neglectable vs. > the time spent in lookup and entering the identifier into a symbol table > (in the parser). > > Even if a lexer reads single characters from a file, most OSs maintain > their own file buffer, so that little overhead is added over the > program-buffered solution. > > I really would like to see some current benchmarks about the behaviour > of current compilers and systems. I was recently testing a new language with a bare lexer for a near-identical syntax. This test, which excluded file input and any identifer lookups (just tokenising), ran at 17M chars/second, 3M tokens/second, and some 800K lines/second, on a low-end desktop PC. At that rate, tokenising the entire compiler might take 25msec. If that represented the bulk of the compilation time, then I'd be quite happy! But I suspect it will be just a fraction. While lexers do have to deal with a character at a time, the processing might be quite simple (eg. merely six instructions for each character of an identifier in my version). Oh, I should mention that this lexer is written in an interpreted, dynamically typed language. That means a native code version might process some 20M tokens per second (3 or 4 msec to scan the entire compiler). That doesn't really suggest it's going to be a bottleneck! > [The benchmarks I did were a while ago, but they showed a large > fraction of time in the lexer. ... -John] Testing an older compiler (which doesn't lend itself to isolating just the bare tokeniser), showed it spent about 50% of compilation time in tokenising, lookups and parsing. This generates simple byte-code, so with a more sophisticated one with more passes, optimisation and so on, it would likely be even less. (But it's also possible my compilers are highly inefficient apart from the tokenising parts..) -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2012-08-20 13:35 +0000 |
| Message-ID | <12-08-009@comp.compilers> |
| In reply to | #723 |
Hans-Peter Diettrich <DrDiettrich1@aol.com> writes: >> [Get it working in bison, then in the unlikely event that's not fast >> enough, profile your compiler to see where it's spending its time and >> fix what needs to be fixed. Although in theory GLR can be very slow, >> in practice the ambiguities are generally resolved within a few tokens >> and the performance is fine. compilers always spend way more time in >> the lexer than the parser anyway. Writing RD parsers by hand can be >> fun, but you never know what language it actually parses. -John] > >There exist parser generators for several models. I also doubt that - >except in misdesigned C-ish languages - a compiler spends significant >time in the lexer. This may be true for dummy parsers, which do >nothing but syntax checks, but not for compilers with code generation, >optimization and more. Code generation and optimization do not change the relation between the time spent in scanning and in parsing. Moreover, if the compiler spends most of the time in code generation, optimization and/or "more", there is even less reason to worry about parsing speed. - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/
[toc] | [prev] | [next] | [standalone]
| From | BGB <cr88192@hotmail.com> |
|---|---|
| Date | 2012-08-21 14:45 -0500 |
| Message-ID | <12-08-014@comp.compilers> |
| In reply to | #726 |
On 8/20/2012 8:35 AM, Anton Ertl wrote:
> Hans-Peter Diettrich <DrDiettrich1@aol.com> writes:
>>> [Get it working in bison, then in the unlikely event that's not fast
>>> enough, profile your compiler to see where it's spending its time and
>>> fix what needs to be fixed. Although in theory GLR can be very slow,
>>> in practice the ambiguities are generally resolved within a few tokens
>>> and the performance is fine. compilers always spend way more time in
>>> the lexer than the parser anyway. Writing RD parsers by hand can be
>>> fun, but you never know what language it actually parses. -John]
>>
>> There exist parser generators for several models. I also doubt that -
>> except in misdesigned C-ish languages - a compiler spends significant
>> time in the lexer. This may be true for dummy parsers, which do
>> nothing but syntax checks, but not for compilers with code generation,
>> optimization and more.
>
> Code generation and optimization do not change the relation between
> the time spent in scanning and in parsing. Moreover, if the compiler
> spends most of the time in code generation, optimization and/or
> "more", there is even less reason to worry about parsing speed.
(sorry in advance for sort of wandering off on a tangent...).
a major thing that I think skews things so much in the case of C is just
how much stuff can get pulled in by the preprocessor, the bulk of which
ends up quickly being discarded by subsequent stages (and much of the
rest is only minimally processed by later stages).
actually, a fairly large portion of the time may actually be spent in
the preprocessor, which may basically sit around, for each source line:
splitting it into tokens;
checking if any of the identifiers in the source-line are names for
macros (and expanding as necessary);
repeating the above until no more macro expansions are seen;
recomposing the line and spitting it into the output.
then, for example, there are typically large numbers of typedefs and
structure definitions, very few of which may end up actually being used
by the code in question (most just get put into a table and then forgotten).
then, after all this, the compiler is typically left looking at a fairly
small chunk of (actual) code.
if we spend 1000ms preprocessing and parsing the code, and 10ms or 100ms
compiling it, where has most of the time gone?...
OTOH, in my scripting language (which doesn't use headers), I have
generally been looking at millisecond-range compile times.
although admittedly most of my compilers have tended to be pretty dumb,
typically working like (for the upper end):
split into tokens and parse the code directly into an AST;
perform basic simplifications on the AST (evaluating constant expressions);
perform basic type-analysis / inference (generally forward propagation
driven by declarations and assignment operations);
flatten this out into a bytecode-style format (binary serialization is
optional).
this is then currently followed by a backend which translates the
bytecode into threaded-code and runs this (generally also pretty fast,
and generally functions/methods are translated on call).
my code-generator backend was a little more involved, basically doing
its own type-analysis, mapping the stack to registers and temporaries,
and using peephole optimization. this backend fell into disuse mostly
because I was having a hard time really debugging it, and it was eating
up a lot of my time trying to do so (threaded code generally being a
simpler option).
unlike many other compilers though, my stuff is generally based on a
stack-machine model, so pretty much everything (from intermediate
values, to handling local variables and environment scopes) is handled
using a stack-machine model.
for example, in my bytecode format, local variables (and function
arguments, and parent scopes) are identified by their position on the
variable stack (with declarations pushing new variables onto the stack,
and other operations popping them back off).
note that my bytecode is lexically scoped, but tends to see the entire
lexical environment as a big stack (which may actually be split across
multiple call frames, or have parts be allocated on the heap as well).
there are also the dynamic, object ("this" scope), and toplevel
environments, but lookups into these are given by name or by
name+signature, rather than by an explicit stack index as in the case of
the lexical environment. note that packages/import build on delegation,
which is built on the object scope (packages are actually implemented as
objects, and imports are actually lexical variables used to delegate to
these objects).
but, in general, I just seem to find the stack model generally a lot
easier to understand and work with.
my code generator also tied register allocation and lifetime to these
stacks as well, although later on it had added temporaries which were
identified by handle and had an independent lifetime, rather than being
tied to the stack-machine model (and were, ironically, much closer to
how the CPU stack was actually being utilized by the code generator, as
the evaluation stack and local/environment stack were not mapped 1:1
with the CPU stack, whereas the temp-vars were keyed 1:1 with their
physical storage).
however, it was introducing temporaries and registers which existed
independently of the stack-machine model which is where the pain began.
if I ever get back to working on or writing a new native code generator,
I will probably just put the temporaries on their own stack, thus any
temporary entering its lifetime is pushed onto the temp-var stack, and
popped off when its lifetime ends, mostly in the name of being less of a
PITA to work with.
note that, in any case, the actual stack-frame (at the machine-level) is
basically a fixed-size region which is divided up and allocated (as
appropriate) for things like holding variables and evaluation-stack
entries (and treated as an array of machine-words allocated using a bitmap).
note that some minor ABI hacks were made to allow for things like
lexical scoping, but otherwise it was using the same ABI as C (C style
block functions being callable directly as C function pointers).
currently, the threaded-code uses a more narrowly defined subset of the
C ABI, basically where all calls have exactly 2 arguments: the VM
context and the current threaded-opcode context. this, combined with
most of the logic being interpreter-style and written in C, does a bit
of harm to the performance though (but, it is still "fast enough" for
general application scripting tasks).
but, hopefully with hope and luck, I will be able to justify the effort
to be able to return to using a full-featured JIT.
granted, I could always just be like, "hell with it, I am just going to
do it..." (whether or not it is worthwhile), which sort of ends up being
invoked occasionally (for things where I can't really justify doing, but
want to do them anyways...).
[toc] | [prev] | [next] | [standalone]
| From | "BartC" <bc@freeuk.com> |
|---|---|
| Date | 2012-08-22 14:04 +0100 |
| Message-ID | <12-08-015@comp.compilers> |
| In reply to | #731 |
"BGB" <cr88192@hotmail.com> wrote in message > On 8/20/2012 8:35 AM, Anton Ertl wrote: >> Code generation and optimization do not change the relation between >> the time spent in scanning and in parsing. Moreover, if the compiler >> spends most of the time in code generation, optimization and/or >> "more", there is even less reason to worry about parsing speed. > > > (sorry in advance for sort of wandering off on a tangent...). > > > a major thing that I think skews things so much in the case of C is just > how much stuff can get pulled in by the preprocessor, the bulk of which > ends up quickly being discarded by subsequent stages (and much of the > rest is only minimally processed by later stages). I think much of that is the fault of the language and over-reliance on the pre-processor. Instead of relying on neat language features to minimise the sizes of headers, everything has to be spelled out in a long-winded way. Lack of proper namespaces (in lists of enumerations for example) means identifiers are long and unwieldy, which doesn't help. > if we spend 1000ms preprocessing and parsing the code, and 10ms or 100ms > compiling it, where has most of the time gone?... But even given all that, there are ways of dealing with huge header files so that it is not necessary to repeatedly tokenise and parse the same headers over and over again (for recompiling the same module, or compiling many modules all sharing the same headers). I've no idea whether many C compilers actually bother though; perhaps it's easier to just recommend a faster computer.. > OTOH, in my scripting language (which doesn't use headers), I have > generally been looking at millisecond-range compile times. Well, that's more like it. Compilation time simply shouldn't be an issue, not for compiling a single module anyway. And has never been a problem for me, ever, no matter what hardware I was using, partly thanks to using my own tools. I only get a delay when there are interdependencies and I just recompile everything for simplicity. Then I might have to twiddle my thumbs for 5-10 seconds. But then, I now have to rely on some external tools.. > although admittedly most of my compilers have tended to be pretty dumb, > typically working like (for the upper end): > split into tokens and parse the code directly into an AST; > perform basic simplifications on the AST (evaluating constant > expressions); > perform basic type-analysis / inference (generally forward propagation > driven by declarations and assignment operations); > flatten this out into a bytecode-style format (binary serialization is > optional). > > this is then currently followed by a backend which translates the > bytecode into threaded-code and runs this (generally also pretty fast, > and generally functions/methods are translated on call). I have two current compilers, one native code, and this one for bytecode for a dynamically typed, non-type-inferred language: source -> lexer/parser -> types -> code generator -> optim -> binary bytecode file The type pass does very little here, mainly checking l-values and reducing constant expressions. The optim pass does very little too, just reducing some common bytecode combinations into one. Nevertheless, the resulting bytecode, even with a straightforward, non JIT-ing interpreter, can make a basic lexer run at some 3M tokens/second as I mentioned in another post. The native code compiler is more involved (I hate these ones! But I need one to implement the interpreter): source -> lexer/parser -> names -> types -> intermediate code generator -> target code generator -> optim -> asm source file The optim stage does some peephole stuff, but haven't gone as far as having some variables allocated to registers. Last time I checked, it was perhaps 40-50% slower than gcc-O1, averaged over ~20 integer/floating-point-intensive benchmarks. That might do me. -- Bartc [I've seen C compilers that keep preparsed versions of headers. Dunno what they do with #if. Also see Microsoft's C# and other .NET languages, that put all of the type info in the objects, so you can use the object as a compiled include file. -John]
[toc] | [prev] | [next] | [standalone]
| From | BGB <cr88192@hotmail.com> |
|---|---|
| Date | 2012-08-26 19:37 -0500 |
| Message-ID | <12-08-018@comp.compilers> |
| In reply to | #732 |
(this post had basically been sitting around several days unfinished and
unsent, mostly as I was more busy with other stuff...).
On 8/22/2012 8:04 AM, BartC wrote:
> "BGB" <cr88192@hotmail.com> wrote in message
>> On 8/20/2012 8:35 AM, Anton Ertl wrote:
>
>>> Code generation and optimization do not change the relation between
>>> the time spent in scanning and in parsing. Moreover, if the compiler
>>> spends most of the time in code generation, optimization and/or
>>> "more", there is even less reason to worry about parsing speed.
>>
>>
>> (sorry in advance for sort of wandering off on a tangent...).
>>
>>
>> a major thing that I think skews things so much in the case of C is just
>> how much stuff can get pulled in by the preprocessor, the bulk of which
>> ends up quickly being discarded by subsequent stages (and much of the
>> rest is only minimally processed by later stages).
>
> I think much of that is the fault of the language and over-reliance on the
> pre-processor. Instead of relying on neat language features to minimise the
> sizes of headers, everything has to be spelled out in a long-winded way.
> Lack of proper namespaces (in lists of enumerations for example) means
> identifiers are long and unwieldy, which doesn't help.
>
well, C is C.
other languages also have things like import, but C doesn't, so alas.
>> if we spend 1000ms preprocessing and parsing the code, and 10ms or 100ms
>> compiling it, where has most of the time gone?...
>
> But even given all that, there are ways of dealing with huge header
> files so
> that it is not necessary to repeatedly tokenise and parse the same headers
> over and over again (for recompiling the same module, or compiling many
> modules all sharing the same headers).
>
> I've no idea whether many C compilers actually bother though; perhaps it's
> easier to just recommend a faster computer..
the problem here is that, although it isn't too hard to figure out
possible optimizations, it is much harder to make them work in ways
which don't violate the C standard.
another issue is that things like precompiled headers are non-standard,
and there is no real agreed-on convention for "hey, compiler, feel free
to use precompiled headers here".
like, say, "#pragma precompiled_header" or similar (say, put before the
include guard to indicate to the compiler that this header may be used
as a precompiled header).
even if supported, it wouldn't generally be found in OS headers, meaning
it would be necessary to wrap the headers to use it.
a partial solution is to allow an option of just lumping together a
bunch of code and compiling it all at once, which can often be faster,
but isn't really applicable in many cases.
>> OTOH, in my scripting language (which doesn't use headers), I have
>> generally been looking at millisecond-range compile times.
>
> Well, that's more like it. Compilation time simply shouldn't be an issue,
> not for compiling a single module anyway.
>
> And has never been a problem for me, ever, no matter what hardware I was
> using, partly thanks to using my own tools.
>
> I only get a delay when there are interdependencies and I just recompile
> everything for simplicity. Then I might have to twiddle my thumbs for 5-10
> seconds. But then, I now have to rely on some external tools..
>
this is part of the issue.
my language instead just uses importing.
so, it is possible to "import" a module, and see whatever is declared in it.
currently, the language uses a mix of a Java and C# style model, where
importing gives the qualified name for the module, which is interpreted
as giving its path (relative to the VM's search path).
within each module, a "package" may be declared, which basically puts
the contents of the module into a given package. functionally, this is
more similar to the use of "namespace" in a language like C#. otherwise,
whatever is declared in the module is placed in the toplevel (it is also
possible to put contents in other packages as well).
there are edge cases here, so modifiers end up being needed both for
package and import directives, mostly to "fine-tune" the behavior.
the current default behavior though is "load the module if-needed, using
its qname, and then try to import the package named by the qname".
currently, there is no direct analogue of Java's "import everything in a
directory" semantics, but I have considered adding something like this,
only I have not yet gotten around to it.
but, in any case, since loading modules doesn't require processing a
small mountain of text in the process, it tends to go a lot faster (and
puts a lot less strain on the compiler/VM as well).
>> although admittedly most of my compilers have tended to be pretty dumb,
>> typically working like (for the upper end):
>> split into tokens and parse the code directly into an AST;
>> perform basic simplifications on the AST (evaluating constant
>> expressions);
>> perform basic type-analysis / inference (generally forward propagation
>> driven by declarations and assignment operations);
>> flatten this out into a bytecode-style format (binary serialization is
>> optional).
>>
>> this is then currently followed by a backend which translates the
>> bytecode into threaded-code and runs this (generally also pretty fast,
>> and generally functions/methods are translated on call).
>
> I have two current compilers, one native code, and this one for bytecode
> for
> a dynamically typed, non-type-inferred language:
>
> source -> lexer/parser -> types -> code generator -> optim -> binary
> bytecode file
>
> The type pass does very little here, mainly checking l-values and reducing
> constant expressions. The optim pass does very little too, just reducing
> some common bytecode combinations into one. Nevertheless, the resulting
> bytecode, even with a straightforward, non JIT-ing interpreter, can make a
> basic lexer run at some 3M tokens/second as I mentioned in another post.
>
> The native code compiler is more involved (I hate these ones! But I need
> one
> to implement the interpreter):
>
> source -> lexer/parser -> names -> types -> intermediate code generator ->
> target code generator -> optim -> asm source file
>
> The optim stage does some peephole stuff, but haven't gone as far as having
> some variables allocated to registers. Last time I checked, it was
> perhaps 40-50% slower than gcc-O1, averaged over ~20
> integer/floating-point-intensive benchmarks. That might do me.
>
for my C compiler, I had a bytecode intermediate stage, but no bytecode
interpreter (apart from a minor oddity that the optimizer logic could be
used itself as a limited interpreter).
this bytecode was then translated into native code.
however, this was not well maintained, and the bytecode format in
question is no longer in use (though the bytecode used by my scripting
language is a "close sibling", it has gone in a different direction in
several ways, and currently only ever goes into being threaded-code).
then I am basically stuck with things as-is, as I am sadly too busy and
raw performance isn't a big enough priority to justify either compiling
my current live bytecode fully into native code, or for that matter to
revive my C compiler.
so, for the C compiler it was basically like:
( preprocessor -> tokenizer+parser -> AST reducer -> bytecode-compiler
-> (RPNIL) )
->
( (RPNIL) -> codegen passes -> (ASM) -> assembler -> (COFF) ->
runtime-linker )
the codegen basically was a multi-pass stack machine.
any values created were pushed to the stack, and operations would pop
values and push results (internally mapped to register or memory
operations). in the compiler, the compiler would "contain" any registers
or variables, rather than representing a physical memory region. entries
on the main value-stack were regarded as immutable. type-analysis was
also performed on the stack.
if values turned out to be constant here, then they would be evaluated
directly (and the literal value being placed on the stack instead).
for the scripting VM, the model is a little different, and type analysis
has been moved largely before emitting bytecode, so in this case it is a
little more like with the JVM or similar.
so, for my scripting VM:
( tokenizer+parser -> AST reducer -> type-inference -> bytecode-compiler
-> (BSIL) )
->
( (BSIL) -> threaded-converter -> (linked-list indirect threaded-code) )
what differences do exist make interfacing my C frontend to my scripting
backend a bit problematic though. a big issue at the moment is that
naive translation of C ASTs to the bytecode would produce giant modules
(with all the header contents as well) and there is no "good" way to
eliminate this after-the-fact.
as-is, in the script VM bytecode, any structures/typedefs/prototypes/...
end up directly in the bytecode, and are "constructed" by executing the
toplevel, which is a bad scenario for C.
otherwise, my VM currently handles C metadata by storing it in a
database-format, which would likely need to be used along with the
bytecode in this case (and omitting all of these declarations from the
toplevel). normally, these databases are used by the VM to annotate
natively-compiled code, and are currently inserted into the DLL
post-compile via a tool (technically, my C compiler front-end is used to
perform analysis and build these databases, from headers, but a native
compiler produces the actual machine-code).
this would likely mean producing object modules containing both bytecode
and a database, and likely having a "link" stage which merges the
database contents.
not that any of this can't be done though, but it is harder to justify
doing so...
> --
> Bartc
> [I've seen C compilers that keep preparsed versions of headers. Dunno
> what they do with #if. Also see Microsoft's C# and other .NET languages,
> that put all of the type info in the objects, so you can use the object
> as a compiled include file. -John]
>
AFAIK, the preparsed/precompiled headers for C generally handle #if and
#ifdef and similar during the preprocessor as usual. this seems to be a
large part of why there are many restrictions on the use of precompiled
headers in those compilers which support them.
AFAICT, languages like C# delay commands like #if or #ifdef until later
(and impose restrictions on how they may be used). IIRC, they are
generally handled at linking or at JIT.
although it "could" be possible to move #if and friends into a later
compilation stage, it is not really possible to do this without
violating the C standard in the process.
one possible compromise could be to introduce another compiler
extension, such as:
#define_target /name/
which would basically say to the compiler "this particular define will
be handled on a per-target basis" and thus delay certain defines to a
later compilation stage.
say:
#define_target FOO
...
#ifdef FOO
...
#endif
could have the preprocessor spit out something like:
__target_ifdef(FOO) {
...
};
but would retain the normal behavior for use of conventional #define's
(and thus generally avoid breaking C standard conformance, or breaking
code which uses macros in "clever" ways to alter the syntax, 1).
1: consider examples like this:
#ifdef SOMEFEATURE
#define SOMEFEATURE_BEGIN
#define SOMEFEATURE_END
#else
#define SOMEFEATURE_BEGIN /##*
#define SOMEFEATURE_END *##/
#endif
or something as mundane as:
#ifndef __cplusplus
#define EXTERN_C extern "C"
#define BEGIN_EXTERN_C EXTERN_C {
#define END_EXTERN_C }
#else
#define EXTERN_C
#define BEGIN_EXTERN_C
#define END_EXTERN_C
#endif
my script-VM actually does something vaguely along these lines, and
folds the contents of any ifdef or ifndef statements into their own
bytecode-blocks ("ifdef" and "ifndef" also have their own special
bytecode instructions). these are handled when code is converted into
threaded code (the instruction either becomes a special-call, or a no-op).
[toc] | [prev] | [next] | [standalone]
| From | "BartC" <bc@freeuk.com> |
|---|---|
| Date | 2012-08-29 22:03 +0100 |
| Subject | Bison deterministic LALR parser for Java/C++ |
| Message-ID | <12-08-020@comp.compilers> |
| In reply to | #735 |
"BGB" <cr88192@hotmail.com> wrote in message news:12-08-018@comp.compilers... > On 8/22/2012 8:04 AM, BartC wrote: >> But even given all that, there are ways of dealing with huge header files so >> that it is not necessary to repeatedly tokenise and parse the same headers >> over and over again (for recompiling the same module, or compiling many >> modules all sharing the same headers). >> >> I've no idea whether many C compilers actually bother though; perhaps >> it's easier to just recommend a faster computer.. > > the problem here is that, although it isn't too hard to figure out > possible optimizations, it is much harder to make them work in ways > which don't violate the C standard. > > another issue is that things like precompiled headers are non-standard, > and there is no real agreed-on convention for "hey, compiler, feel free > to use precompiled headers here". Why should how a compiler optimises its work violate the standard? Provided the end results are the same, the compiler can do what it likes. However the C language and the way the C compilers are typically invoked (for example just one-time to compile one module, so it's forgotten it's compiled the same sets of headers a moment before) doesn't make things easy. And it's possible that things such as __TIME__ have been used in such a way that you are obliged to recompile a header each time. So it's easy to see why compilers may not bother! Nevertheless, I think there is plenty that can be done, although I'm not sure that creating intermediate files such as precompiled headers is the way to go. It's better when a compiler is properly integrated into an IDE, then the symbol tables built by a set of headers can be made persistent much more easily. Alternatively, it might be possible to just have a very faster parser! And perhaps integrate the preprocessor so that it is not a separate pass (I haven't attempted a C compiler so not sure if that's feasible; my own source-level directives are handled by the lexer itself, or sometimes by the parser). >> [I've seen C compilers that keep preparsed versions of headers. Dunno >> what they do with #if. Also see Microsoft's C# and other .NET languages, >> that put all of the type info in the objects, so you can use the object >> as a compiled include file. -John] > > AFAIK, the preparsed/precompiled headers for C generally handle #if and > #ifdef and similar during the preprocessor as usual. this seems to be a > large part of why there are many restrictions on the use of precompiled > headers in those compilers which support them. > > AFAICT, languages like C# delay commands like #if or #ifdef until later > (and impose restrictions on how they may be used). IIRC, they are > generally handled at linking or at JIT. With a new language then it's much easier to arrange matters so that it's faster and simpler to compile. It might not even have conditional directives, or any preprocessor at all; (C needs them because it is a cruder, older language; I used to have conditional code, but no longer and in any case it seemed an unsatisfactory approach). -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | BGB <cr88192@hotmail.com> |
|---|---|
| Date | 2012-09-04 13:45 -0500 |
| Subject | speeding up C recompilation, was Re: Bison deterministic LALR |
| Message-ID | <12-09-005@comp.compilers> |
| In reply to | #739 |
On 8/29/2012 4:03 PM, BartC wrote:
> "BGB" <cr88192@hotmail.com> wrote in message
>> On 8/22/2012 8:04 AM, BartC wrote:
>
>>> But even given all that, there are ways of dealing with huge header
>>> files so
>>> that it is not necessary to repeatedly tokenise and parse the same
>>> headers
>>> over and over again (for recompiling the same module, or compiling many
>>> modules all sharing the same headers).
>>>
>>> I've no idea whether many C compilers actually bother though; perhaps
>>> it's easier to just recommend a faster computer..
>>
>> the problem here is that, although it isn't too hard to figure out
>> possible optimizations, it is much harder to make them work in ways
>> which don't violate the C standard.
>>
>> another issue is that things like precompiled headers are non-standard,
>> and there is no real agreed-on convention for "hey, compiler, feel free
>> to use precompiled headers here".
>
> Why should how a compiler optimises its work violate the standard?
> Provided the end results are the same, the compiler can do what it
> likes.
basically, because the C compilation process is difficult to optimize
while not changing behavior in various ("minor") ways.
a major one is the preprocessor, where:
defines in one header may subsequently cause different expansions in
another;
pretty much anything that can be done via textual substitution is
allowed by the preprocessor;
there are macros and built-in defines which may have time-dependent or
invocation-dependent behaviors;
...
and, the standard requires that all this sort of stuff works as-expected.
> However the C language and the way the C compilers are typically
> invoked (for example just one-time to compile one module, so it's
> forgotten it's compiled the same sets of headers a moment before)
> doesn't make things easy. And it's possible that things such as
> __TIME__ have been used in such a way that you are obliged to
> recompile a header each time.
yeah.
or things as simple as:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
where basically the presence of this define does itself alter the
behavior and contents of the header.
a partial optimization though is to optimize for the case where all the
modules in a library are compiled in batches, and where all of them use
the same header-lines. then it is possible to preprocess/parse the
headers once, and then reuse them for subsequent modules.
I think I remember at least one compiler doing this (apparently Borland?).
MSVC basically uses the whole "stdafx.h" thing, where they precompile
this and have (sometimes "force") everything to include it.
GCC apparently only allows it for the case of including a single header.
my considered option was basically having a way (in the headers) to tell
the compiler that it is allowed to violate the C standard in certain
ways for sake of optimizing the compilation process.
> So it's easy to see why compilers may not bother!
yeah.
the lack of a good way to approach the problem, and the relative
simplicity of the old/slow route, leaves the old/slow strategy as the
main way most code is compiled.
> Nevertheless, I think there is plenty that can be done, although I'm
> not sure that creating intermediate files such as precompiled headers
> is the way to go. It's better when a compiler is properly integrated
> into an IDE, then the symbol tables built by a set of headers can be
> made persistent much more easily.
>
actually, caching all of the definitions/defines/... resulting from
processing a set of headers can be done (I have done so in the past,
partly as an extension intended to allow faster "eval" of C fragments,
and this is also partly how the FFI for my BGBScript language works).
however, compiling C code using these would itself likely violate the
standard in subtle ways, so likely couldn't be used as a general-purpose
optimization.
the most obvious way to use this in a "moderately standard" way would,
ironically, likely assume a form resembling that of the "stdafx.h" system.
> Alternatively, it might be possible to just have a very faster parser!
> And perhaps integrate the preprocessor so that it is not a separate
> pass (I haven't attempted a C compiler so not sure if that's feasible;
> my own source-level directives are handled by the lexer itself, or
> sometimes by the parser).
I am not entirely sure that this would buy much, but would likely be
more complicated (and using a single pass for preprocess+parse+lexing
could actually make the overall process slower).
a partial issue is that the preprocessor works line-by-line, but a C
parser does not.
combining them would likely lead to a case where there were mismatches
between what the preprocessor has currently expanded, and what the
parser needs to parse a complete statement.
so, it is generally simpler and more effective to expand everything out
prior to passing it off to the parser, but the preprocessor need not
necessarily serialize back to text:
it is possible that the preprocessor spits out basically a large array
of tokens, and the parser works off this (as a separate pass).
note that my parsers have not usually used a separate lexing and parsing
pass, but have usually broken the source-text into tokens "mid-flight",
often using a cache (covering a range of 256 bytes) to avoid reading the
same tokens multiple times.
>>> [I've seen C compilers that keep preparsed versions of headers. Dunno
>>> what they do with #if. Also see Microsoft's C# and other .NET
>>> languages,
>>> that put all of the type info in the objects, so you can use the object
>>> as a compiled include file. -John]
>>
>> AFAIK, the preparsed/precompiled headers for C generally handle #if and
>> #ifdef and similar during the preprocessor as usual. this seems to be a
>> large part of why there are many restrictions on the use of precompiled
>> headers in those compilers which support them.
>>
>> AFAICT, languages like C# delay commands like #if or #ifdef until later
>> (and impose restrictions on how they may be used). IIRC, they are
>> generally handled at linking or at JIT.
>
> With a new language then it's much easier to arrange matters so that
> it's faster and simpler to compile. It might not even have conditional
> directives, or any preprocessor at all; (C needs them because it is a
> cruder, older language; I used to have conditional code, but no longer
> and in any case it seemed an unsatisfactory approach).
well, there is probably a reason why most newer languages have not been
designed this way, but this does not eliminate the problem of most
existing code being written in C or C++ (or the general ineffectiveness
of newer languages to really do well the things that C and C++ do well).
for example, my BGBScript language does not currently use a preprocessor
(I would not entirely oppose the idea, as there are possible uses, but
currently I don't use one, and although the language has an
"ifdef"/"ifndef" construction, it is handled in the threaded-code
translation stage).
however, the language is not itself likely a good C substitute (it
wasn't really designed to fill the same roles as C, and has much more in
common with ECMAScript, although it does borrow some more aspects of C
syntax and semantics).
but, it does have a merit:
even for "relatively large" modules (10-20 kB), it seems to have compile
times in the low-millisecond range, and with a little optimization could
(probably) be pushed into the microsecond range.
(the compiler has generally been fast enough that thus far I have mostly
always been loading modules directly from source).
I have considered ideas for C-like languages could compile much faster
(and potentially also allow running in a VM), but even if they looked
like C and were 99% code-compatible, if they don't follow the C
standards, they aren't really C.
a major idea here being for a language which would look-like and be
"mostly" code-compatible with C99, but would largely change those parts
of the language which impede fast compilation and generating
portable/verifiable bytecode (actually, a lot more modest/subtle than it
may seem, it would be more like a "fake" version of C).
basic ideas: traditional includes are replaced ("#include" would
actually function more like an "import" directive), the preprocessor is
merged into the syntax and no longer works via textual substitution,
declaration parsing would work more like in C#, ...
semantics would involve "reinterpreting" the meaning of a lot of things
(for example, pointer syntax and operations would be retained, but what
they "mean" would actually be replaced with something else, ...). in
some cases, it would likely incorporate many semantics and compilation
strategies used in ECMAScript-like languages as well.
(something like this would likely be an ugly beast from a design POV
though, and one can debate the value of a language which would look like
C but be functionally and semantically somewhat different).
sadly, I don't have a justification for the time/effort such a project
would require, as my main effort is trying to get a marketable 3D game
made, such that I can hopefully have *some* form of personal income...
[toc] | [prev] | [next] | [standalone]
| From | Marco van de Voort <marcov@toad.stack.nl> |
|---|---|
| Date | 2012-09-05 08:40 +0000 |
| Subject | Re: C include handling, was Bison deterministic LALR |
| Message-ID | <12-09-006@comp.compilers> |
| In reply to | #739 |
On 2012-08-29, BartC <bc@freeuk.com> wrote: >> another issue is that things like precompiled headers are non-standard, >> and there is no real agreed-on convention for "hey, compiler, feel free >> to use precompiled headers here". > > Why should how a compiler optimises its work violate the standard? > Provided the end results are the same, the compiler can do what it > likes. True, but that is a complex problem, since processing a header starts with a preprocessor state (at the point of including), then processing it changes compiler and preprocessor state. I'm not a C expert, but it probably depends on the state of the compiler on inclusion too (e.g. when you can check if a type is already defined or not) Both the output states are dependent on (both) the input state which in turn depend on the point of inclusion. So precompiled headers should probably reduce the input state to the bare essentials, and reduce the changes to the output change accordingly. Then one should be able to check if the reduced input state matches the one in the precompiled header before it is loaded. >> AFAIK, the preparsed/precompiled headers for C generally handle #if and >> #ifdef and similar during the preprocessor as usual. this seems to be a >> large part of why there are many restrictions on the use of precompiled >> headers in those compilers which support them. >> >> AFAICT, languages like C# delay commands like #if or #ifdef until later >> (and impose restrictions on how they may be used). IIRC, they are >> generally handled at linking or at JIT. Or they use a different way of importing that zaps preprocessor/compiler state to some known global state, and doesn't allow code before #include, removing the dependency on compiler state. [The preprocessor conceptually happens before the parse, so preprocessor state doesn't depend on compiler state. But #if means that the sequence of includes affects both the preprocessor state and the compiler state. -John]
[toc] | [prev] | [next] | [standalone]
| From | hsad005@gmail.com |
|---|---|
| Date | 2012-08-18 02:09 -0700 |
| Message-ID | <12-08-007@comp.compilers> |
| In reply to | #722 |
One of my concern is if I will get anywhere with Bison/yacc based approach, given that I have the restriction that I will have to use just one lexer token for both type as well as non-type identifier refenreces. [This is a frequent issue, not hard to deal with particularly if you use GLR for the trickier cases. -John]
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web