Re: Is This a Dumb Idea? paralellizing byte codes

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups	comp.compilers
Subject	Re: Is This a Dumb Idea? paralellizing byte codes
Date	2022-10-23 13:16 +0000
Organization	Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID	<22-10-056@comp.compilers> (permalink)
References	<22-10-046@comp.compilers> <22-10-048@comp.compilers>

Show all headers | View raw

Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:
>I've heard/read several times that byte-code micro-optimizations are not
>worth the trouble.

Apart from the paper below, which is discussed below, what else?

>Here is a paper from 2015 on a related subject
>("Branch prediction and the performance of interpreters -- Don't trust
>folklore"):
>
>https://ieeexplore.ieee.org/document/7054191
>
>(you may find the corresponding research report if you can't access the
>full text from that site). It shows how far processors have gone in what
>was once left to the program designer.

On that I can only say: Not all research papers are trustworthy.
Catchy titles may be a warning signal.

I did my own measurements on a Haswell (the same CPU they used in the
paper) and published them in
<2015Sep7.142507@mips.complang.tuwien.ac.at>
(<http://al.howardknight.net/?ID=158702747000> for those of you who
don't know what to do with Message-IDs).

If you don't want to read that posting, the executive summary is that
the sentence in the abstract "we show that the accuracy of indirect
branch prediction is no longer critical for interpreters." is wrong.

Looking at the run-time in seconds for the large benchmarks:

| shared  non-shared
|  --no-       --no-      --no-
|dynamic     dynamic      super  default
|  3.332       2.440      2.276    1.468 benchgc
|  1.652       1.524      1.064    0.544 brainless
|  4.016       3.416      2.288    1.520 brew
|  3.420       3.236      2.232    1.232 cd16sim
|  2.956       2.684      1.484    0.864 fcp
| 13.128       9.848      9.280    7.840 lexex

We see a speedup factor of 1.08-1.37 (but, measuring mispredictions,
no consistent reduction in mispredictions) from (non-shared)
dispatching the code for the next VM instruction at the end of the
code of every VM instruction, rather than jumping to a shared piece of
dispatch code (from "shared --no-dynamic" to "non-shared
--no-dynamic").

We see a speedup factor of 1.06-1.81 and a reduction in mispredictions
by a factor 1.35-8.76 from replicating the code for each occurence of
a VM instruction (from "non-shared --no-dynamic" to "--no-super").

We see a speedup factor of 1.18-1.96 and a reduction in branch
mispredictions by up to a factor of 3.2 by then eliminating the
dispatches at the end of non-control-flow VM instructions (from
"--no-super" to default).

The overall speedup factor for all these steps is 1.67-3.42.

The somewhat longer summary from the posting above:

|Haswell's indirect branch prediction is really much
|better than before, but for larger programs running on an interpreter
|like Gforth, replication still provides substantial branch prediction
|improvements that result in significant speedups.  And while there is
|no longer a branch prediction advantage to keeping separate indirect
|branches for dispatch, there is still a significant speed advantage;
|and dynamic superinstructions also provide a good speedup, resulting
|in overall speedups by factors 1.67-3.42 for the application
|benchmarks.
|
|Why are the results here different from those in the paper?
|1) Different Interpreter 2) different benchmarks.  If you write an
|interpreter, and look for performance, should you go for interpreter
|optimizations like threaded-code, replication, and dynamic
|superinstructions like I suggest, or just use a switch-based
|interpreter like the paper suggests?  Threaded code is a good idea
|with little cost in any case.  If that provides a significant speedup
|and your VM instruction implementations are short (you run into cache
|trouble with long VM instructions [vitale&abdelrahman04]), then
|replication with superinstructions will probably give a good speedup.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Is This a Dumb Idea? paralellizing byte codes Jon Forrest <nobozo@gmail.com> - 2022-10-22 11:00 -0700
  Re: Is This a Dumb Idea? paralellizing byte codes Alain Ketterlin <alain@universite-de-strasbourg.fr> - 2022-10-22 23:50 +0200
    Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-10-23 13:16 +0000
      Re: Is This a Dumb Idea? paralellizing byte codes Alain Ketterlin <alain@universite-de-strasbourg.fr> - 2022-10-23 21:29 +0200
        Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-10-28 17:06 +0000
  Re: Is This a Dumb Idea? paralellizing byte codes Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2022-10-23 02:21 +0200
  Re: Is This a Dumb Idea? paralellizing byte codes gah4 <gah4@u.washington.edu> - 2022-10-22 23:50 -0700
  Parallelizing byte codes Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-10-23 10:17 +0300
  Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-10-23 12:33 +0000
  Re: Is This a Dumb Idea? paralellizing byte codes gah4 <gah4@u.washington.edu> - 2022-10-26 18:18 -0700
    Re: Is This a Dumb Idea? paralellizing byte codes Kaz Kylheku <864-117-4973@kylheku.com> - 2022-10-27 14:51 +0000
    Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-10-29 09:06 +0000

csiph-web