Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
| Message-ID | <4F5777FD.4090505@SPAM.comp-arch.net> (permalink) |
|---|---|
| Date | 2012-03-07 07:00 -0800 |
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
| Organization | comp-arch.net |
| Newsgroups | comp.arch |
| Subject | Re: Ne(ish) IBM z196 synchronization instructions |
| References | <4F56E21F.2010903@SPAM.comp-arch.net> <9a6b0bf4-9cbc-4dc9-a2dd-9aa34f078f65@d17g2000vba.googlegroups.com> |
On 3/7/2012 5:43 AM, Paul A. Clayton wrote: > On Mar 6, 11:20 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net> > wrote: > [snip] >> It is a bit unusual to be able to extend an existing instruction in this >> way. If these existing instructions were in wide use, one would expect >> existing code to be somewhat slowed down. >> However (1) I suspect the trend towards simpler OOO instructions has >> already made these "add to memory" instructions slower, >> while on the other hand (2) advanced implementations make such atomic >> instructions neglibly slower than their non-atomic versions. > > Why would guaranteeing the atomicity of aligned add immediate > to memory make such noticeably slower in the uncontended case > which would be safe without atomicity? It doesn't need to, as I think I noted. Ideally, an uncontended atomic RMW should have exactly the same performance as a non-atomic RMW. However, Intel has only recently begun approaching that ideal, even though I started down that path circa 1991. Hmm... as I write this I may have realized that IBM z-Series's strong memory ordering model, which some people say is stronger[*] than Intel's ostensible TSO, may have made it easier for IBM to acheive this. Or, rather, perhaps just as hard, but necessary to be IBM mainframe compatible. (Note [*]: Many people have said that IBM's memory ordering model is really strong ordering, but when I read the Principles of Operation I don't see this - I think weaker models like TSO are permitted according to my reading of the POP.) Anyway... Prior to P6, locked operations (a) bypassed the cache, (b) asserted a bus lock signal. For that matter, (c) were done non-speculatively, although most things were at that time. (d) drained store buffers. All of these things made a locked RMW considerably more expensive. P6 started doing locked RMWs in the cache, using the cache protocol. Instead of locking the bus, it would delay snoops while the locked RMW was going on. P6 also made the locked RMW fencing - no later loads could pass it, the store could not be queued up. This was required because of Intel's memory ordering model - and if IBM has a higher performance implementation to cope with its stronger memory ordering model, they may already have solved that problem. (Actually, it would have been correct to allow later loads to pass the load-locked, and rely on the MOB snooping (aka Frey rule consistency - Brad Frey is from IBM) to detect problems. P6 chose not to do this largely because of FUD - justified, as it turned out, by some serious bugs that making the RMW fencing worked around. Furthermore, P6 did not do the load part of the atomic RMW speculatively. These amounted to making locked RMWs considerably more expensive than unlocked, even if no contention.) Over the years these constraints have slowly been removed. I can't give a full timeline, but they had not completely gone away when I left Intel. I would be surprised, but happy, if they had been completely eliminated by SandyBridge and/or IvyBridge - or, for that matter, by Haswell. Allowing later loads to pass the locked RMW is only the first step. A more interesting question is whether to do the load-locked part early, speculatively, or not. If no contention, then doing the load-locked speculatively is not a problem. However, if the way you are handling things is to block snoops to the cache line(s) involved between the load-locked and store-unlocked parts of the RMW, then doing the load-locked early rather than at the last minute increases the window in which the line is locked - thereby increasing the window of time for contention. Which can potentially be hundreds of cycles, rather than the 1-10 cycles it might be locked if done non-speculatively. Equivalently, if you cancel or redo the load-locked via an expensive mechanism if contention, then increasing the window is a bad thing. A nice approach is to prefetch, and do the load-locked at the last minute... but that loses some of the benefit of doing the load-locked speculatively. Probably the best thing is a combination of prefetch, doing the load-locked speculatively, combined with an efficient redo/replay mechanism. (Not "nuke the machine and start over"). Note that it is not uncommon for there to be operations dependent on the load-locked value. Apart from all of that stuff, for Intel the putative requirement to drain the store buffer before the store-unlock is an impediment even in the uncontended case. However, IBM's putative strong ordering model requires them to behave as if this is done on all stores, and before all loads. So this is the place where locked RMWs may be relatively slower at Intel than IBM. I say putative, because there are several techniques to to allow store buffering even on SC (e.g. you can buffer stores if you already own the lines; you can even buffer if you don't own the lines, so long as you can make certain guarantees about keeping other processors out (which is hard to do without deadlock), and/or starting from a checkpoint corresponding to a store in the buffer (which amounts to saying that you aren't really buffering after architectural commit). Anyway, while it is possible to make a locked atomic RMW as efficient as an unlocked RMW, particularly in the no-contention case, it has taken Intel two decades to get there. I would be a bit surprised if IBM got there right away, on what appears to be their first OOO mainframe in a long time. But, I know, I know - IBM has lots and lots of history. Still, I am willing to place a small bet that my conjecture is correct.
Back to comp.arch | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-06 20:20 -0800
Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-07 05:43 -0800
Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-07 07:00 -0800
Re: Ne(ish) IBM z196 synchronization instructions nmm1@cam.ac.uk - 2012-03-07 15:25 +0000
Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-09 11:05 -0800
Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-07 20:00 +0100
Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-07 13:43 -0800
Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 09:08 -0800
Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-08 18:39 +0100
Re: Ne(ish) IBM z196 synchronization instructions Nomen Nescio <nobody@dizum.com> - 2012-03-08 22:33 +0100
Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 17:17 -0800
Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 08:19 -0800
Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-09 17:45 +0100
Re: Ne(ish) IBM z196 synchronization instructions jgk@panix.com (Joe keane) - 2012-03-09 20:30 +0000
Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 13:10 -0800
csiph-web