Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.compilers
Subject: Re: Is This a Dumb Idea? paralellizing byte codes
Date: Sun, 23 Oct 2022 13:16:54 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-10-056@comp.compilers>
References: <22-10-046@comp.compilers> <22-10-048@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="86284"; mail-complaints-to="abuse@iecc.com"
Keywords: parallel, interpreter
Posted-Date: 23 Oct 2022 12:33:53 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3226

Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:
>I've heard/read several times that byte-code micro-optimizations are not
>worth the trouble.

Apart from the paper below, which is discussed below, what else?

>Here is a paper from 2015 on a related subject
>("Branch prediction and the performance of interpreters -- Don't trust
>folklore"):
>
>https://ieeexplore.ieee.org/document/7054191
>
>(you may find the corresponding research report if you can't access the
>full text from that site). It shows how far processors have gone in what
>was once left to the program designer.

On that I can only say: Not all research papers are trustworthy.
Catchy titles may be a warning signal.

I did my own measurements on a Haswell (the same CPU they used in the
paper) and published them in
<2015Sep7.142507@mips.complang.tuwien.ac.at>
(<http://al.howardknight.net/?ID=158702747000> for those of you who
don't know what to do with Message-IDs).

If you don't want to read that posting, the executive summary is that
the sentence in the abstract "we show that the accuracy of indirect
branch prediction is no longer critical for interpreters." is wrong.

Looking at the run-time in seconds for the large benchmarks:

| shared  non-shared
|  --no-       --no-      --no-
|dynamic     dynamic      super  default
|  3.332       2.440      2.276    1.468 benchgc
|  1.652       1.524      1.064    0.544 brainless
|  4.016       3.416      2.288    1.520 brew
|  3.420       3.236      2.232    1.232 cd16sim
|  2.956       2.684      1.484    0.864 fcp
| 13.128       9.848      9.280    7.840 lexex

We see a speedup factor of 1.08-1.37 (but, measuring mispredictions,
no consistent reduction in mispredictions) from (non-shared)
dispatching the code for the next VM instruction at the end of the
code of every VM instruction, rather than jumping to a shared piece of
dispatch code (from "shared --no-dynamic" to "non-shared
--no-dynamic").

We see a speedup factor of 1.06-1.81 and a reduction in mispredictions
by a factor 1.35-8.76 from replicating the code for each occurence of
a VM instruction (from "non-shared --no-dynamic" to "--no-super").

We see a speedup factor of 1.18-1.96 and a reduction in branch
mispredictions by up to a factor of 3.2 by then eliminating the
dispatches at the end of non-control-flow VM instructions (from
"--no-super" to default).

The overall speedup factor for all these steps is 1.67-3.42.

The somewhat longer summary from the posting above:

|Haswell's indirect branch prediction is really much
|better than before, but for larger programs running on an interpreter
|like Gforth, replication still provides substantial branch prediction
|improvements that result in significant speedups.  And while there is
|no longer a branch prediction advantage to keeping separate indirect
|branches for dispatch, there is still a significant speed advantage;
|and dynamic superinstructions also provide a good speedup, resulting
|in overall speedups by factors 1.67-3.42 for the application
|benchmarks.
|
|Why are the results here different from those in the paper?
|1) Different Interpreter 2) different benchmarks.  If you write an
|interpreter, and look for performance, should you go for interpreter
|optimizations like threaded-code, replication, and dynamic
|superinstructions like I suggest, or just use a switch-based
|interpreter like the paper suggests?  Threaded code is a good idea
|with little cost in any case.  If that provides a significant speedup
|and your VM instruction implementations are short (you run into cache
|trouble with long VM instructions [vitale&abdelrahman04]), then
|replication with superinstructions will probably give a good speedup.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/