Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!Xl.tags.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local2.nntp.dca.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sat, 11 Feb 2012 17:35:30 -0600
Message-ID: <4F36FB3A.3070304@SPAM.comp-arch.net>
Date: Sat, 11 Feb 2012 15:35:22 -0800
From: "Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Reply-To: andy@SPAM.comp-arch.net
Organization: comp-arch.net
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20111222 Thunderbird/9.0.1
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: M68k add to memory is not a mistake any more
References: <ggtgp-8D1AEA.03180231012012@netnews.mchsi.com> <jgeu27$9rg$1@dont-email.me> <a7csv8-emn1.ln1@ntp6.tmsw.no> <jgiibd$hkq$1@dont-email.me> <4F2CD4FA.4050004@SPAM.comp-arch.net> <jgjn1i$nqe$1@dont-email.me> <ave009-2ft1.ln1@ntp6.tmsw.no> <jgk6tk$kcb$1@dont-email.me> <j17209-p972.ln1@ntp6.tmsw.no> <jguq3a$rb6$1@dont-email.me> <4F341C7B.9070703@SPAM.comp-arch.net> <g32f09-nl8.ln1@ntp6.tmsw.no> <jh6c36$90d$1@USTR-NEWS.TR.UNISYS.COM>
In-Reply-To: <jh6c36$90d$1@USTR-NEWS.TR.UNISYS.COM>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 53
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-6Pu2YeZKriCam+Dx74I2bdI498mdi/rr2x00mzJR0RtGz8Kwv0SWliw9/l0gdhKfqUoTaSKb/XfAUnj!O9a8FzQAeMmHG4K/bVrZk1wgdmiS3dIUuu/n1gT47pSQPRf6aesm1UoFZ71s+L4=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 4168
Xref: x330-a1.tempe.blueboxinc.net comp.arch:5885

On 2/11/2012 10:29 AM, Tim McCaffrey wrote:
> In article<g32f09-nl8.ln1@ntp6.tmsw.no>, "terje.mathisenattmsw.no" says...
>>
>
>> The part of disliking anything totally fixed is borne out by the x86
>> usage of CL/CX/ECX/RCX for all variable shift counts: I have never seen
>> a compiler that understood that said shift(s) could be so
>> (latency-)critical that all the rest of the register allocations had to
>> be switched around, just so that those particular shift counts would end
>> up naturally in ECX.
>>
>
> Not to mention all the other instructions with dedicated registers (string
> instructions (ECX, EDI, ESI, EAX), I/O, loop&  jcxz, Multiply, divide, etc).
> (BTW, why is jcxz expensive and CMP CX,0/JZ is not?)

http://download.intel.com/design/pentiumii/manuals/24281603.pdf gives 
JCXZ a uop count of 2, versus 1 uop for the "normal" Jccs.

I don't think that JCXZ is intrinsically expensive.  It could be decoded 
into a uop TargetIP := JumpIfReg0(CX,Offset), if Intel's datapath 
supported that.

It's just that there is only so much room in the fast path decoders. The 
"normal" Jcc pretty much have to be fast. They look something like 
TargetIP := Jcc(Flag,Offset).

JCXZ would require something verging on a new uop type (I say "verging 
on", since there is not that much difference between JumpIfReg0 and Jcc, 
on an out of order machine if the flags are attached to the high bits of 
a physical register - e.g. bits >31 on a 32 bit machine, >63 on a 64 bit 
machine, etc.  But if the flags are a completely differehnt type of 
dataflow operand, then a new uop type.)

JCXZ isn't close to Jcc, so would have required an extra pattern - 
minterms - in the fast decoder to make fast. But Jcc is close to a whole 
slew of other slow, deprecated, instructions - the other LOOP 
instructions, IN and OUT - so probably naturally wants to be covered by 
minterms for slowness.

Back on earlier in-order machines, JCXZ was deprecated.  E.g. on Pentium 
it was NP, non-paired.

JCXZ only comes with an 8 bit displacement, as do the LOOP instructions,
whereas the other Jccs can have 16 bit displacements.  This is probably 
the kiss of death: JCZ and LOOP* mikght well have been made fast, if 
they had 16 bit displacements, displacements large enough to be useful. 
  But, with only an 8 bit displacement, nobody uses them; and since 
nobody uses them, nobody optimizes for them.

Compare-Jcc fusion, although equivalent in "complexity" to JCXZ, is much 
more useful/much more widely used.