Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.compilers Subject: Re: Is This a Dumb Idea? paralellizing byte codes Date: Sun, 23 Oct 2022 13:16:54 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-10-056@comp.compilers> References: <22-10-046@comp.compilers> <22-10-048@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="86284"; mail-complaints-to="abuse@iecc.com" Keywords: parallel, interpreter Posted-Date: 23 Oct 2022 12:33:53 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3226 Alain Ketterlin writes: >I've heard/read several times that byte-code micro-optimizations are not >worth the trouble. Apart from the paper below, which is discussed below, what else? >Here is a paper from 2015 on a related subject >("Branch prediction and the performance of interpreters -- Don't trust >folklore"): > >https://ieeexplore.ieee.org/document/7054191 > >(you may find the corresponding research report if you can't access the >full text from that site). It shows how far processors have gone in what >was once left to the program designer. On that I can only say: Not all research papers are trustworthy. Catchy titles may be a warning signal. I did my own measurements on a Haswell (the same CPU they used in the paper) and published them in <2015Sep7.142507@mips.complang.tuwien.ac.at> ( for those of you who don't know what to do with Message-IDs). If you don't want to read that posting, the executive summary is that the sentence in the abstract "we show that the accuracy of indirect branch prediction is no longer critical for interpreters." is wrong. Looking at the run-time in seconds for the large benchmarks: | shared non-shared | --no- --no- --no- |dynamic dynamic super default | 3.332 2.440 2.276 1.468 benchgc | 1.652 1.524 1.064 0.544 brainless | 4.016 3.416 2.288 1.520 brew | 3.420 3.236 2.232 1.232 cd16sim | 2.956 2.684 1.484 0.864 fcp | 13.128 9.848 9.280 7.840 lexex We see a speedup factor of 1.08-1.37 (but, measuring mispredictions, no consistent reduction in mispredictions) from (non-shared) dispatching the code for the next VM instruction at the end of the code of every VM instruction, rather than jumping to a shared piece of dispatch code (from "shared --no-dynamic" to "non-shared --no-dynamic"). We see a speedup factor of 1.06-1.81 and a reduction in mispredictions by a factor 1.35-8.76 from replicating the code for each occurence of a VM instruction (from "non-shared --no-dynamic" to "--no-super"). We see a speedup factor of 1.18-1.96 and a reduction in branch mispredictions by up to a factor of 3.2 by then eliminating the dispatches at the end of non-control-flow VM instructions (from "--no-super" to default). The overall speedup factor for all these steps is 1.67-3.42. The somewhat longer summary from the posting above: |Haswell's indirect branch prediction is really much |better than before, but for larger programs running on an interpreter |like Gforth, replication still provides substantial branch prediction |improvements that result in significant speedups. And while there is |no longer a branch prediction advantage to keeping separate indirect |branches for dispatch, there is still a significant speed advantage; |and dynamic superinstructions also provide a good speedup, resulting |in overall speedups by factors 1.67-3.42 for the application |benchmarks. | |Why are the results here different from those in the paper? |1) Different Interpreter 2) different benchmarks. If you write an |interpreter, and look for performance, should you go for interpreter |optimizations like threaded-code, replication, and dynamic |superinstructions like I suggest, or just use a switch-based |interpreter like the paper suggests? Threaded code is a good idea |with little cost in any case. If that provides a significant speedup |and your VM instruction implementations are short (you run into cache |trouble with long VM instructions [vitale&abdelrahman04]), then |replication with superinstructions will probably give a good speedup. - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/