Path: csiph.com!1.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.compilers Subject: Re: Is This a Dumb Idea? paralellizing byte codes Date: Sun, 23 Oct 2022 12:33:40 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-10-055@comp.compilers> References: <22-10-046@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="85659"; mail-complaints-to="abuse@iecc.com" Keywords: optimize, interpreter Posted-Date: 23 Oct 2022 12:32:50 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3225 Jon Forrest writes: >Modern CPUs employ all kinds of clever techniques to improve >instruction level parallelism (ILP). I was wondering if it >makes sense to try to employ similar techniques in the >virtual machines used to execute byte code produced by language >compilers. I'll first assume that you mean virtual-machine interpreters. First of all, if you use a modern CPU with OoO execution, it does much of that by itself. If you use an in-order execution CPU, you can try to software-pipeline the execution of the virtual-machine instructions, in particular perform virtual-machine instruction fetching early [hoogerbrugge+99]. It is helpful to design the virtual machine for that: Have a mostly fixed encoding, so that, e.g., the dispatch can be prepared in advance, rather than, e.g., having to wait until you know about the present instruction before you can start the fetch of the next instruction. >By that I mean what if virtual machines were to examine byte code >streams to detect when it would be safe to execute multiple >byte codes concurrently? Then, based on its findings, the virtual >machine would execute as many byte codes concurrently as is safe. There has been quite a bit work on combining sequences of virtual-machine instructions into superinstructions [hughes82,proebsting95,naessen+01,ertl&gregg03euroforth,eller05], but the typical approach was to combine a sequence of dependent instructions, which has several advantages: * One VM instruction often communicates the data to subsequent ones through memory (i.e., D-cache), which often incurs a latency of several cycles. The superinstruction can communicate the data through a register. * If the data is addressed explicitly by the VM instruction (e.g., in a register-oriented VM), that requires code for dealing with this addressing, which can be eliminated for the data passed implicitly in a superinstruction. E.g. "add r3=r1*r2; mul r5=r3+r4" requires dealing with six VM "register" accesses, while "madd r5=r1*r2+r4" requires only four. If you are actually thinking about JIT compilation of the virtual machine, you can use the usual techniques for extracting instruction-level parallelism: instruction scheduling (within basic blocks, or within traces or superblocks), and software pipelining. Some of these techniques incur a significant overhead, however, so you may not want to employ them in a JIT compiler. >I'm also worried that internal virtual machine locking requirements >might make this idea infeasible. For example, in a virtual machine with >a global interpreter lock, would it be possible for there to be any >concurrent execution? The memory-ordering requirements may restrict the reorderings of memory accesses, but otherwise I don't see a problem. >This idea, if it works, would be a great way to take advantage of >multiple cores without having to rewrite any user code. The big >question is whether it would work. The stuff I have outlined does not introduce additional execution threads, and I don't think you can find thread-level parallelism at the virtual-machine level unless you have your virtual machine (and the programming language that it models) designed for exactly that. - anton @string{spe="Software---Practice and Experience"} @Article{hoogerbrugge+99, author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum and Rik van de Wiel", title = "A code compression system based on pipelined interpreters", journal = spe, volume = "29", number = "11", pages = "1005--1023", month = sep, year = "1999", OPTannote= "" } @InProceedings{hughes82, author = "R. J. M. Hughes", title = "Super-Combinators", booktitle = "Conference Record of the 1980 LISP Conference, Stanford, CA", pages = "1--11", publisher = "ACM", address = "New York", year = "1982", OPTannote = {} } @InProceedings{proebsting95, author = "Todd A. Proebsting", title = "Optimizing an {ANSI~C} Interpreter with Superoperators", crossref = "popl95", pages = "322--332", annote = "Interpreter performance is optimized by combining operators during code generation, when they are still organized as trees. So a different, optimized interpreter is used for each program. Speedups of 1.8--3.1 are achieved, but this is probably strongly dependent on the original set of operators. The paper uses lccs intermediate code operators \cite{fraser&hanson91a}." } @Proceedings{popl95, booktitle = "Principles of Programming Languages (POPL '95)", title = "Principles of Programming Languages (POPL '95)", year = "1995", key = "POPL '95" } @InProceedings{naessen+01, author = {Henrik N\"ass\'en and Mats Carlsson and Konstantinos Sagonas}, title = {Instruction Merging and Specialization in the {SICStus Prolog} Virtual Machine}, booktitle = {Principles and Practice of Declarative Programming (PPDP01)}, OPTpages = {}, year = {2001}, url = {http://www.csd.uu.se/%7Ekostis/Papers/sicstus.ps.gz}, annote = {Gives an overview of various WAM optimization techniques and then evaluates combining (merging) pairs of instructions into (about 60) superinstructions, specializing WAM instructions for specific immediate arguments (in particular, specific registers, for about 200 new instructions), and a combination of both (for about 100 new instructions). Instruction merging produces small speedups (about 8\% on average), specialization produces a small slowdown on average, and both combined are about as fast as instruction merging alone. VM code size is reduced by around 10\% with these techniques, and the VM emulator size grows by up to 15KB.} } @InProceedings{ertl&gregg03euroforth, author = {M. Anton Ertl and David Gregg}, title = {Implementation Issues for Superinstructions in {Gforth}}, booktitle = {EuroForth 2003 Conference Proceedings}, OPTpages = {}, year = {2003}, URL = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03euroforth.ps.gz}, abstract = {Combining Forth primitives into superinstructions provides nice speedups. Several approaches to superinstructions were explored in the Gforth project. This paper discusses the effects of these approaches on performance, compilation time, implementation effort, and on programming tools such as the decompiler and backtracing.} } @MastersThesis{eller05, author = {Helmut Eller}, title = {Optimizing Interpreters with Superinstructions}, school = {TU Wien}, year = {2005}, type = {Diplomarbeit}, url = {http://www.complang.tuwien.ac.at/Diplomarbeiten/eller05.ps.gz}, abstract = {Superinstructions can be used to make virtual machine (VM) interpreters faster. A superinstruction is a combination of simpler VM instructions which can be executed faster than the corresponding sequence of simpler VM instructions, because the interpretative overhead, like instruction dispatch and argument fetching, is reduced. This work discusses the following three topics related to superinstructions. First, I present some heuristics to choose superinstructions. I evaluated the heuristics for Forth and Java programs. If the number of allowed superinstructions was very large, $> 1000$, then the heuristic which chooses all possible subsequences up to length 4 achieved the best results. If the number of allowed superinstructions was more limited, then a heuristic which favors short sequences and sequences which occur in many different programs and many different basic blocks performed better than the others. Second, I compare a simple greedy algorithm and an optimal algorithm to cover a program with superinstructions. I found that the greedy algorithm achieves almost optimal results. Finally, I compare superinstructions with non-sequential patterns. In my experiments, superinstructions performed slightly better than non-sequential patterns.} } -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/