Re: Ne(ish) IBM z196 synchronization instructions

Message-ID	<4F5777FD.4090505@SPAM.comp-arch.net> (permalink)
Date	2012-03-07 07:00 -0800
From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Organization	comp-arch.net
Newsgroups	comp.arch
Subject	Re: Ne(ish) IBM z196 synchronization instructions
References	<4F56E21F.2010903@SPAM.comp-arch.net> <9a6b0bf4-9cbc-4dc9-a2dd-9aa34f078f65@d17g2000vba.googlegroups.com>

Show all headers | View raw

On 3/7/2012 5:43 AM, Paul A. Clayton wrote:
> On Mar 6, 11:20 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net>
> wrote:
> [snip]
>> It is a bit unusual to be able to extend an existing instruction in this
>> way.  If these existing instructions were in wide use, one would expect
>> existing code to be somewhat slowed down.
>> However (1) I suspect the trend towards simpler OOO instructions has
>> already made these "add to memory" instructions slower,
>> while on the other hand (2) advanced implementations make such atomic
>> instructions neglibly slower than their non-atomic versions.
>
> Why would guaranteeing the atomicity of aligned add immediate
> to memory make such noticeably slower in the uncontended case
> which would be safe without atomicity?

It doesn't need to, as I think I noted.  Ideally, an uncontended atomic 
RMW should have exactly the same performance as a non-atomic RMW.

However, Intel has only recently begun approaching that ideal, even 
though I started down that path circa 1991.

Hmm... as I write this I may have realized that IBM z-Series's strong 
memory ordering model, which some people say is stronger[*] than Intel's 
ostensible TSO, may have made it easier for IBM to acheive this.  Or, 
rather, perhaps just as hard, but necessary to be IBM mainframe compatible.

(Note [*]: Many people have said that IBM's memory ordering model is 
really strong ordering, but when I read the Principles of Operation I 
don't see this - I think weaker models like TSO are permitted according 
to my reading of the POP.)

Anyway...

Prior to P6, locked operations (a) bypassed the cache, (b) asserted a 
bus lock signal.  For that matter, (c) were done non-speculatively, 
although most things were at that time. (d) drained store buffers. All 
of these things made a locked RMW considerably more expensive.

P6 started doing locked RMWs in the cache, using the cache protocol. 
Instead of locking the bus, it would delay snoops while the locked RMW 
was going on.  P6 also made the locked RMW fencing - no later loads 
could pass it, the store could not be queued up. This was required 
because of Intel's memory ordering model - and if IBM has a higher 
performance implementation to cope with its stronger memory ordering 
model, they may already have solved that problem.

(Actually, it would have been correct to allow later loads to pass the 
load-locked, and rely on the MOB snooping (aka Frey rule consistency - 
Brad Frey is from IBM) to detect problems.  P6 chose not to do this 
largely because of FUD - justified, as it turned out, by some serious 
bugs that making the RMW fencing worked around. Furthermore, P6 did not 
do the load part of the atomic RMW speculatively.  These amounted to 
making locked RMWs considerably more expensive than unlocked, even if no 
contention.)

Over the years these constraints have slowly been removed. I can't give 
a full timeline, but they had not completely gone away when I left 
Intel.  I would be surprised, but happy, if they had been completely 
eliminated by SandyBridge and/or IvyBridge - or, for that matter, by 
Haswell.

Allowing later loads to pass the locked RMW is only the first step.

A more interesting question is whether to do the load-locked part early, 
speculatively, or not.

If no contention, then doing the load-locked speculatively is not a 
problem.  However, if the way you are handling things is to block snoops 
to the cache line(s) involved between the load-locked and store-unlocked 
parts of the RMW, then doing the load-locked early rather than at the 
last minute increases the window in which the line is locked - thereby 
increasing the window of time for contention.  Which can potentially be 
hundreds of cycles, rather than the 1-10 cycles it might be locked if 
done non-speculatively. Equivalently, if you cancel or redo the 
load-locked via an expensive mechanism if contention, then increasing 
the window is a bad thing.  A nice approach is to prefetch, and do the 
load-locked at the last minute... but that loses some of the benefit of 
doing the load-locked speculatively.

Probably the best thing is a combination of prefetch, doing the 
load-locked speculatively, combined with an efficient redo/replay 
mechanism. (Not "nuke the machine and start over"). Note that it is not 
uncommon for there to be operations dependent on the load-locked value.

Apart from all of that stuff, for Intel the putative requirement to 
drain the store buffer before the store-unlock is an impediment even in 
the uncontended case. However, IBM's putative strong ordering model 
requires them to behave as if this is done on all stores, and before all 
loads.  So this is the place where locked RMWs may be relatively slower 
at Intel than IBM. I say putative, because there are several techniques 
to to allow store buffering even on SC (e.g. you can buffer stores if 
you already own the lines; you can even buffer if you don't own the 
lines, so long as you can make certain guarantees about keeping other 
processors out (which is hard to do without deadlock), and/or starting 
from a checkpoint corresponding to a store in the buffer  (which amounts 
to saying that you aren't really buffering after architectural commit).

Anyway, while it is possible to make a locked atomic RMW as efficient as 
an unlocked RMW, particularly in the no-contention case, it has taken 
Intel two decades to get there.  I would be a bit surprised if IBM got 
there right away, on what appears to be their first OOO mainframe in a 
long time.  But, I know, I know - IBM has lots and lots of history. 
Still, I am willing to place a small bet that my conjecture is correct.

Thread

Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-06 20:20 -0800
  Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-07 05:43 -0800
    Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-07 07:00 -0800
      Re: Ne(ish) IBM z196 synchronization instructions nmm1@cam.ac.uk - 2012-03-07 15:25 +0000
      Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-09 11:05 -0800
  Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-07 20:00 +0100
    Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-07 13:43 -0800
      Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 09:08 -0800
        Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-08 18:39 +0100
          Re: Ne(ish) IBM z196 synchronization instructions Nomen Nescio <nobody@dizum.com> - 2012-03-08 22:33 +0100
          Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 17:17 -0800
          Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 08:19 -0800
            Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-09 17:45 +0100
    Re: Ne(ish) IBM z196 synchronization instructions jgk@panix.com (Joe keane) - 2012-03-09 20:30 +0000
      Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 13:10 -0800

csiph-web