Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!border4.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED!nerds-end From: Hans-Peter Diettrich Newsgroups: comp.compilers Subject: Re: lexer speed, was Bison Date: Mon, 20 Aug 2012 16:14:38 +0100 Organization: Compilers Central Lines: 51 Sender: johnl@iecc.com Approved: comp.compilers@iecc.com Message-ID: <12-08-010@comp.compilers> References: <12-08-005@comp.compilers> <12-08-006@comp.compilers> <12-08-008@comp.compilers> NNTP-Posting-Host: news.iecc.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: leila.iecc.com 1345514847 92794 64.57.183.58 (21 Aug 2012 02:07:27 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Tue, 21 Aug 2012 02:07:27 +0000 (UTC) Keywords: parse, performance, comment Posted-Date: 20 Aug 2012 22:07:27 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:727 Hans-Peter Diettrich schrieb: >> [Compilers spend a lot of time in the lexer, because that's the only >> phase that has to look at the input one character at a time. -John] > > When the source code resides in a memory buffer, the time for reading > e.g. the characters of an identifier (in the lexer) is neglectable vs. > the time spent in lookup and entering the identifier into a symbol table > (in the parser). > > Even if a lexer reads single characters from a file, most OSs maintain > their own file buffer, so that little overhead is added over the > program-buffered solution. > > I really would like to see some current benchmarks about the behaviour > of current compilers and systems. > > DoDi > [The benchmarks I did were a while ago, but they showed a large > fraction of time in the lexer. I wouldn't disagree that building the > symbol table is slow, but figure out some estimate of the ratio of > the number of characters in a source file to the number of tokens, > and that is a rough estimate of how much slower the lexer will be > than the parser. I agree that some current analyses would be useful. > -John] Your estimation suggests that the benchmarked parser does not much more than syntax checking the tokens. In this case I agree that a parser automaton does less work (state transitions) than a lexer automaton (or equivalents), and that detailed benchmarking is not required to proof this fact. I still disagree that "Compilers spend a lot of time in the lexer". My point is that a real *compiler* does not normally spend significant time in neither the lexer nor the parser[1]. In so far it's okay to "profile your compiler to see where it's spending its time and fix what needs to be fixed", as you said :-) [1] The timing depends heavily on what tasks are assigned to the parser itself. Is code generation part of the parser, or part of a subsequent (optimization...) stage? What about symbol table (scopes) management and lookup, and an eventual preprocessor which IMO in an C compiler consumes more time than the lexer and parser altogether? But further discussion of these academic issues doesn't help in spotting the real bottlenecks of any concrete compiler ;-) DoDi [Perhaps you could do some experiments and tell us whether it confirms your intuition. Like I said, when I profiled some compilers, I was surprised to see how much time they spent grinding through the characters in the input. -John]