Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED!nerds-end From: "BartC" Newsgroups: comp.compilers Subject: Re: lexer speed, was Bison Date: Tue, 21 Aug 2012 17:39:38 +0100 Organization: A noiseless patient Spider Lines: 52 Sender: johnl@iecc.com Approved: comp.compilers@iecc.com Message-ID: <12-08-013@comp.compilers> References: <12-08-005@comp.compilers> <12-08-006@comp.compilers> <12-08-008@comp.compilers> NNTP-Posting-Host: news.iecc.com Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset="UTF-8"; reply-type=original Content-Transfer-Encoding: 7bit X-Trace: leila.iecc.com 1345597109 14839 64.57.183.58 (22 Aug 2012 00:58:29 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Wed, 22 Aug 2012 00:58:29 +0000 (UTC) Keywords: parse, lex, performance Posted-Date: 21 Aug 2012 20:58:29 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:730 "Hans-Peter Diettrich" wrote in message >> [Compilers spend a lot of time in the lexer, because that's the only >> phase that has to look at the input one character at a time. -John] > > When the source code resides in a memory buffer, the time for reading > e.g. the characters of an identifier (in the lexer) is neglectable vs. > the time spent in lookup and entering the identifier into a symbol table > (in the parser). > > Even if a lexer reads single characters from a file, most OSs maintain > their own file buffer, so that little overhead is added over the > program-buffered solution. > > I really would like to see some current benchmarks about the behaviour > of current compilers and systems. I was recently testing a new language with a bare lexer for a near-identical syntax. This test, which excluded file input and any identifer lookups (just tokenising), ran at 17M chars/second, 3M tokens/second, and some 800K lines/second, on a low-end desktop PC. At that rate, tokenising the entire compiler might take 25msec. If that represented the bulk of the compilation time, then I'd be quite happy! But I suspect it will be just a fraction. While lexers do have to deal with a character at a time, the processing might be quite simple (eg. merely six instructions for each character of an identifier in my version). Oh, I should mention that this lexer is written in an interpreted, dynamically typed language. That means a native code version might process some 20M tokens per second (3 or 4 msec to scan the entire compiler). That doesn't really suggest it's going to be a bottleneck! > [The benchmarks I did were a while ago, but they showed a large > fraction of time in the lexer. ... -John] Testing an older compiler (which doesn't lend itself to isolating just the bare tokeniser), showed it spent about 50% of compilation time in tokenising, lookups and parsing. This generates simple byte-code, so with a more sophisticated one with more passes, optimisation and so on, it would likely be even less. (But it's also possible my compilers are highly inefficient apart from the tokenising parts..) -- Bartc