Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!74.125.64.80.MISMATCH!postnews.google.com!news1.google.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post01.iad.highwinds-media.com!newsfe08.iad.POSTED!00000000!not-for-mail From: Owen Jacobson Newsgroups: comp.lang.java.programmer Message-ID: <201104280037257944-angrybaldguy@gmailcom> References: <295e16b3-2ed8-4529-bfb0-1cc26ed93ad6@d26g2000prn.googlegroups.com> <188cz6ta97kkm$.dlg@kimmeringer.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: Java left shift and right shift operators. User-Agent: Unison/2.1.4 Lines: 101 X-Complaints-To: abuse@UsenetServer.com NNTP-Posting-Date: Thu, 28 Apr 2011 04:37:27 UTC Date: Thu, 28 Apr 2011 00:37:25 -0400 Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:3326 On 2011-04-27 08:34:32 -0400, Patricia Shanahan said: > On 4/26/2011 10:40 PM, Travers Naran wrote: > ... >> ;; if (b > 0) return a << b else return a >> -b >> compare b to 0 >> jump-less-than to L1 >> a := a << b >> jump L2 >> L1: >> b := -b >> a := a >> b >> L2: >> return ;; result in a > ... >> It's the jump that's adding the time. But on an x86, that kind of >> instruction can be entirely cached and the Intel/AMD processors can >> perform predictive branching to speed up this bit of code. >> >> The only way you should be getting 20-50 times longer is if you >> aren't using any sort of optimizing run-time. Example: running always >> as byte code without compiling. >> >> But usually this kind of if-checking and bit-twiddling is so fast >> that it's not worth optimizing beyond a single statement. What are >> you coding that makes this the most expensive calculation in your >> program? > ... > > The obvious possibility is a random mix of left and right shifts, with > the two directions equally probable. In that case, the hardware branch > prediction would mis-predict, on average, half the time. There are worse distributions. Consider a program that alternates between the two possible branch outcomes, assuming a four-bit branch prediction counter like those found in several generations of Intel chips: it's possible that the sequence will start by acting opposite to the prediction and flipping the predicted direction for the next branch (from b01 to b10 or vice versa). A strictly alternating sequence of branches would then be mis-predicted *every time*. One of the things that could've made Itanic so cool was the ability to pre-emptively evaluate both sides of a branch, many instructions in advance, and throw away all the alternatives but the one that "actually happened". Unfortunately, programming around this was really hard even for experienced compiler writers, and the chips were notoriously power-hungry and expensive, so it never really took off. > When the hardware branch prediction gets it wrong, the result is a > complete pipeline flush. With deep pipelines and multiple instructions > in each pipeline stage, a pipeline flush costs the equivalent of dozens > of simple arithmetic operations. > > My question is whether the measurement was from the actual program, or > from a benchmark that may differ from the real program in ways that > affect branch prediction. My question is "who cares?" There are lots of ways to squeeze more performance out of an otherwise-optimal algorithm: make better use of vector operations, migrate some or all of the work to the GPU (a massively-parallel floating point unit), or find ways to avoid running it at all. Measuring the cost of an if statement is pretty far down the list of tuning choices. (Having seen examples of Sanny's code in other newsgroups, I am extremely doubtful that he's reached the "optimal algorithm" stage. However, I haven't seen his code, so it's possible.) >> If branching is "20-50 times longer" than any of the operations >> (negation or shifting), how could you re-write this assembly to use a >> single branch statement? > > Why bother? Your sample code only has one conditional branch, > "jump-less-than to L1". Unconditional jumps to a fixed target, such as > "jump L2" are always correctly predicted, and take about the same time > as a simple arithmetic operation. Unless they force a cache stall. For a routine as trivial as the one Travers posted, it's unlikely that the final 'return' would be on a separate cache line from the start of the routine, but in a longer routine, replacing the jump-to-return with a return might actually keep the end of the routine out of cache. Of course, if the code that called this routine has fallen out of cache, the return instruction will force a cache stall, instead. > The method is short enough that the > JVM will in-line it anywhere it is frequently called so we don't have to > worry about the return. Agree. Even though it's not actually magic and doesn't always produce totally optimal (either in time or in space) code, the JIT compiler produces surprisingly good code most of the time. Second-guessing it down at the level of individual machine instructions is the sort of thing that impresses bystanders but rarely accomplishes anything useful. -o