Re: lexer speed, was Bison

From	BGB <cr88192@hotmail.com>
Newsgroups	comp.compilers
Subject	Re: lexer speed, was Bison
Date	2012-08-20 14:14 -0500
Organization	albasani.net
Message-ID	<12-08-011@comp.compilers> (permalink)
References	<12-08-005@comp.compilers> <12-08-006@comp.compilers> <12-08-008@comp.compilers>

Show all headers | View raw

On 8/19/2012 7:01 PM, Hans-Peter Diettrich wrote:
>> [Compilers spend a lot of time in the lexer, because that's the only
>> phase that has to look at the input one character at a time. -John]
>
> When the source code resides in a memory buffer, the time for reading
> e.g. the characters of an identifier (in the lexer) is neglectable vs.
> the time spent in lookup and entering the identifier into a symbol table
> (in the parser).
>
> Even if a lexer reads single characters from a file, most OSs maintain
> their own file buffer, so that little overhead is added over the
> program-buffered solution.
>
> I really would like to see some current benchmarks about the behaviour
> of current compilers and systems.

My experiences are similar to those John is describing.

Basically, given the lexer/tokenizer processes things a character at a
time, it itself can end up being the bulk of the parsing time.

Usually this is only really noticable though in cases where the
tokenizer has to deal with a fairly large amount of text, such as what
happens when writing a C parser (and suddenly one may find itself with
10+ MB of preprocessor output to churn through).

My assembler had a similar issue, although it is generally fast enough
(and ASM code is typically small enough) that this isn't really a major
issue.

granted, for a C style parser, how this compares with having to check
whether or not an identifier is a type-name, is a secondary issue.

things like lookups can be potentially fairly expensive, but this can be
largely resolved via the use of hash tables, say, using a hash table to
identify keywords (mapping the name to a keyword ID-number or similar)
and lookup type-names (1).

my scripting language partly avoids this issue in the parser by making
most statements start with a keyword (pretty much anything that doesn't
start with a keyword is an expression), and identifying type-names by
the use of the language syntax (making them functionally no different
than normal identifiers). (OTOH... the script-language parser is a bit
less efficient, and works primarily via identifying statement types via
if/else logic and using "!strcmp()" to match keywords).

1: actually, my C parser uses a pretty big hash for type-name lookup,
namely 16k entry IIRC, whereas for most other things 1024 or 4096 entry
is plenty sufficient (for a chain-hash). the main reason for this is the
large number of typedefs in headers like "windows.h", which can easily
saturate a 1024 or 4096 entry hash.

I have used bigger hash tables though, namely for things like interning
strings (64k) and dynamic+interface method-dispatch (256k, but this one
is an open-address hash).

> DoDi
> [The benchmarks I did were a while ago, but they showed a large
> fraction of time in the lexer.  I wouldn't disagree that building the
> symbol table is slow, but figure out some estimate of the ratio of
> the number of characters in a source file to the number of tokens,
> and that is a rough estimate of how much slower the lexer will be
> than the parser. I agree that some current analyses would be useful.
> -John]
>

yep.

can't compare exactly, as my parsers tend to be recursive-descent and
build ASTs directly.

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support hsad005@gmail.com - 2012-08-17 11:22 -0700
  Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-18 10:13 +0100
    Re: lexer speed, was Bison Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-20 01:01 +0100
      Re: lexer speed, was Bison Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-20 16:14 +0100
      Re: lexer speed, was Bison BGB <cr88192@hotmail.com> - 2012-08-20 14:14 -0500
        Re: lexer speed, was Bison Hans-Peter Diettrich <DrDiettrich1@aol.com> - 2012-08-21 07:40 +0100
      Re: lexer speed, was Bison "BartC" <bc@freeuk.com> - 2012-08-21 17:39 +0100
    Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-08-20 13:35 +0000
      Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support BGB <cr88192@hotmail.com> - 2012-08-21 14:45 -0500
        Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support "BartC" <bc@freeuk.com> - 2012-08-22 14:04 +0100
          Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support BGB <cr88192@hotmail.com> - 2012-08-26 19:37 -0500
            Bison deterministic LALR parser for Java/C++ "BartC" <bc@freeuk.com> - 2012-08-29 22:03 +0100
              speeding up C recompilation, was Re: Bison deterministic LALR BGB <cr88192@hotmail.com> - 2012-09-04 13:45 -0500
              Re: C include handling, was Bison deterministic LALR Marco van de Voort <marcov@toad.stack.nl> - 2012-09-05 08:40 +0000
  Re: Bison deterministic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support hsad005@gmail.com - 2012-08-18 02:09 -0700

csiph-web