Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!74.125.64.80.MISMATCH!postnews.google.com!news1.google.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post01.iad.highwinds-media.com!newsfe08.iad.POSTED!00000000!not-for-mail
From: Owen Jacobson <angrybaldguy@gmail.com>
Newsgroups: comp.lang.java.programmer
Message-ID: <201104280037257944-angrybaldguy@gmailcom>
References: <295e16b3-2ed8-4529-bfb0-1cc26ed93ad6@d26g2000prn.googlegroups.com> <188cz6ta97kkm$.dlg@kimmeringer.de> <ip6igu$j0g$2@news.albasani.net> <aecb4896-0502-4d79-bc7d-c6ef2219ed94@r35g2000prj.googlegroups.com> <ip8a8m$iul$1@dont-email.me> <wKqdnZYsRd7CkCXQnZ2dnUVZ_gGdnZ2d@earthlink.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: Java left shift and right shift operators.
User-Agent: Unison/2.1.4
Lines: 101
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 28 Apr 2011 04:37:27 UTC
Date: Thu, 28 Apr 2011 00:37:25 -0400
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:3326

On 2011-04-27 08:34:32 -0400, Patricia Shanahan said:

> On 4/26/2011 10:40 PM, Travers Naran wrote:
> ...
>> ;; if (b > 0) return a << b else return a >> -b
>> compare b to 0
>> jump-less-than to L1
>> a := a << b
>> jump L2
>> L1:
>> b := -b
>> a := a >> b
>> L2:
>> return  ;; result in a
> ...
>> It's the jump that's adding the time. But on an x86, that kind of
>> instruction can be entirely cached and the Intel/AMD processors can
>> perform predictive branching to speed up this bit of code.
>> 
>> The only way you should be getting 20-50 times longer is if you
>> aren't using any sort of optimizing run-time. Example: running always
>> as byte code without compiling.
>> 
>> But usually this kind of if-checking and bit-twiddling is so fast
>> that it's not worth optimizing beyond a single statement. What are
>> you coding that makes this the most expensive calculation in your
>> program?
> ...
> 
> The obvious possibility is a random mix of left and right shifts, with
> the two directions equally probable. In that case, the hardware branch
> prediction would mis-predict, on average, half the time.

There are worse distributions. Consider a program that alternates 
between the two possible branch outcomes, assuming a four-bit branch 
prediction counter like those found in several generations of Intel 
chips: it's possible that the sequence will start by acting opposite to 
the prediction and flipping the predicted direction for the next branch 
(from b01 to b10 or vice versa). A strictly alternating sequence of 
branches would then be mis-predicted *every time*.

One of the things that could've made Itanic so cool was the ability to 
pre-emptively evaluate both sides of a branch, many instructions in 
advance, and throw away all the alternatives but the one that "actually 
happened". Unfortunately, programming around this was really hard even 
for experienced compiler writers, and the chips were notoriously 
power-hungry and expensive, so it never really took off.

> When the hardware branch prediction gets it wrong, the result is a
> complete pipeline flush. With deep pipelines and multiple instructions
> in each pipeline stage, a pipeline flush costs the equivalent of dozens
> of simple arithmetic operations.
> 
> My question is whether the measurement was from the actual program, or
> from a benchmark that may differ from the real program in ways that
> affect branch prediction.

My question is "who cares?"

There are lots of ways to squeeze more performance out of an 
otherwise-optimal algorithm: make better use of vector operations, 
migrate some or all of the work to the GPU (a massively-parallel 
floating point unit), or find ways to avoid running it at all. 
Measuring the cost of an if statement is pretty far down the list of 
tuning choices.

(Having seen examples of Sanny's code in other newsgroups, I am 
extremely doubtful that he's reached the "optimal algorithm" stage. 
However, I haven't seen his code, so it's possible.)

>> If branching is "20-50 times longer" than any of the operations
>> (negation or shifting), how could you re-write this assembly to use a
>> single branch statement?
> 
> Why bother? Your sample code only has one conditional branch,
> "jump-less-than to L1". Unconditional jumps to a fixed target, such as
> "jump L2" are always correctly predicted, and take about the same time
> as a simple arithmetic operation.

Unless they force a cache stall. For a routine as trivial as the one 
Travers posted, it's unlikely that the final 'return' would be on a 
separate cache line from the start of the routine, but in a longer 
routine, replacing the jump-to-return with a return might actually keep 
the end of the routine out of cache.

Of course, if the code that called this routine has fallen out of 
cache, the return instruction will force a cache stall, instead.

> The method is short enough that the
> JVM will in-line it anywhere it is frequently called so we don't have to
> worry about the return.

Agree. Even though it's not actually magic and doesn't always produce 
totally optimal (either in time or in space) code, the JIT compiler 
produces surprisingly good code most of the time. Second-guessing it 
down at the level of individual machine instructions is the sort of 
thing that impresses bystanders but rarely accomplishes anything useful.

-o