Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!border4.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED!nerds-end From: BGB Newsgroups: comp.compilers Subject: Re: lexer speed, was Bison Date: Mon, 20 Aug 2012 14:14:31 -0500 Organization: albasani.net Lines: 74 Sender: johnl@iecc.com Approved: comp.compilers@iecc.com Message-ID: <12-08-011@comp.compilers> References: <12-08-005@comp.compilers> <12-08-006@comp.compilers> <12-08-008@comp.compilers> NNTP-Posting-Host: news.iecc.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: leila.iecc.com 1345514906 93016 64.57.183.58 (21 Aug 2012 02:08:26 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Tue, 21 Aug 2012 02:08:26 +0000 (UTC) Keywords: parse, lex,performance Posted-Date: 20 Aug 2012 22:08:25 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:728 On 8/19/2012 7:01 PM, Hans-Peter Diettrich wrote: >> [Compilers spend a lot of time in the lexer, because that's the only >> phase that has to look at the input one character at a time. -John] > > When the source code resides in a memory buffer, the time for reading > e.g. the characters of an identifier (in the lexer) is neglectable vs. > the time spent in lookup and entering the identifier into a symbol table > (in the parser). > > Even if a lexer reads single characters from a file, most OSs maintain > their own file buffer, so that little overhead is added over the > program-buffered solution. > > I really would like to see some current benchmarks about the behaviour > of current compilers and systems. My experiences are similar to those John is describing. Basically, given the lexer/tokenizer processes things a character at a time, it itself can end up being the bulk of the parsing time. Usually this is only really noticable though in cases where the tokenizer has to deal with a fairly large amount of text, such as what happens when writing a C parser (and suddenly one may find itself with 10+ MB of preprocessor output to churn through). My assembler had a similar issue, although it is generally fast enough (and ASM code is typically small enough) that this isn't really a major issue. granted, for a C style parser, how this compares with having to check whether or not an identifier is a type-name, is a secondary issue. things like lookups can be potentially fairly expensive, but this can be largely resolved via the use of hash tables, say, using a hash table to identify keywords (mapping the name to a keyword ID-number or similar) and lookup type-names (1). my scripting language partly avoids this issue in the parser by making most statements start with a keyword (pretty much anything that doesn't start with a keyword is an expression), and identifying type-names by the use of the language syntax (making them functionally no different than normal identifiers). (OTOH... the script-language parser is a bit less efficient, and works primarily via identifying statement types via if/else logic and using "!strcmp()" to match keywords). 1: actually, my C parser uses a pretty big hash for type-name lookup, namely 16k entry IIRC, whereas for most other things 1024 or 4096 entry is plenty sufficient (for a chain-hash). the main reason for this is the large number of typedefs in headers like "windows.h", which can easily saturate a 1024 or 4096 entry hash. I have used bigger hash tables though, namely for things like interning strings (64k) and dynamic+interface method-dispatch (256k, but this one is an open-address hash). > DoDi > [The benchmarks I did were a while ago, but they showed a large > fraction of time in the lexer. I wouldn't disagree that building the > symbol table is slow, but figure out some estimate of the ratio of > the number of characters in a source file to the number of tokens, > and that is a rough estimate of how much slower the lexer will be > than the parser. I agree that some current analyses would be useful. > -John] > yep. can't compare exactly, as my parsers tend to be recursive-descent and build ASTs directly.