Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.arch > #6282 > unrolled thread
| Started by | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| First post | 2012-03-06 20:20 -0800 |
| Last post | 2012-03-09 13:10 -0800 |
| Articles | 15 — 7 participants |
Back to article view | Back to comp.arch
Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-06 20:20 -0800
Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-07 05:43 -0800
Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-07 07:00 -0800
Re: Ne(ish) IBM z196 synchronization instructions nmm1@cam.ac.uk - 2012-03-07 15:25 +0000
Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-09 11:05 -0800
Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-07 20:00 +0100
Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-07 13:43 -0800
Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 09:08 -0800
Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-08 18:39 +0100
Re: Ne(ish) IBM z196 synchronization instructions Nomen Nescio <nobody@dizum.com> - 2012-03-08 22:33 +0100
Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 17:17 -0800
Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 08:19 -0800
Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-09 17:45 +0100
Re: Ne(ish) IBM z196 synchronization instructions jgk@panix.com (Joe keane) - 2012-03-09 20:30 +0000
Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 13:10 -0800
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-03-06 20:20 -0800 |
| Subject | Ne(ish) IBM z196 synchronization instructions |
| Message-ID | <4F56E21F.2010903@SPAM.comp-arch.net> |
http://semipublic.comp-arch.net/wiki/Interlocked-access_facility = New IBM z196 synchronization instructions = The IBM z196 adds new [[synchronization instructions]] to the [[IBM System z Mainframe ISA]]. Augmenting the legacy instructions * [[COMPARE AND SWAP]] * [[PERFORM LOCKED OPERATION]] * [[TEST AND SET]] The reference [share] comments "there is no need for a COMPARE AND SWAP loop to perform these operations!" - their exclamation mark! This suggests the motivation for these instructions - while [[compare and swap]] is one of the most powerful synchronization instructions, it is not necessarily efficient. Atomic operations such as [[atomic add to memory]] can perform in one instruction, without a loop, things that would require looping for [[compare and swap]], [[test-and-set]], and [[load-linked store-conditional]]. Looping that may require special mechanisms to guarantee forward progress. The z196 [[interlocked-access facility]] instructions include New atomic instructions: * [[LOAD AND ADD]] * [[LOAD AND ADD LOGICAL]] * [[LOAD AND AND]] * [[LOAD AND EXCLUSIVE OR]] * [[LOAD AND OR]] <UL> The "LOAD AND ..." part of these instructions' names is a bit misleading. These are really just [[fetch-and-op]] instructions to memory. Fetching the old value, performing the operation with a register input, storing the new value so produced, and returning the old value in a register. Flavours: add signed/unsigned, logical and/or/xor, 32/64 bits wide. </UL> An interesting instruction that I feel must be called [[pseudo-atomic]], in much the same way [[LL/SC]] is [[pseudo-atomic]]: * [[LOAD PAIR DISJOINT]] <UL> [[LOAD PAIR DISJOINT]] loads two separate, non-contiguous, memory locations (each of which must be naturally aligned), into an [[even/odd register pair]]. It sets condition codes to indicate whether the fetch was atomic, or whether some other CPU or channel managed to sneak in a store between them. (The language in the [share] presentation suggests that the condition codes are only set if a store actually occurred, not "may have occurred" - since if thye latter, a correct implementation might always returned "failed to be atomic".) GLEW COMMENT: although undoubtedly most of the time such actions are atomic, it is not clear to me that there is any guarantee of forward progress, for a loop around [[LOAD PAIR DISJOINT]] that spins until atomicity is observed. I almost suspect that there may be plans to eventually provide some such guarantees, but that the detail of such guarantees does not want to be cast in concrete architecture at the time of introduction. </UL> In addition, the existing instructions that perform [[add immediate to memory]], in signed/unsigned and 32b/64b forms, are declared to be atomic when "the storage operand is aligned on an [[integral boundary]]" - IBM speak for [[natural alignment]]. * [[ADD IMMEDIATE]] * [[ADD LOGICAL WITH SIGNED IMMEDIATE]] <UL> GLEW COMMENT: it is obviously desirable to be able to have a fetch-and-add immediate to memory, to avoid having to blow a register on the operand for the atomic modification. It is a bit unusual to be able to extend an existing instruction in this way. If these existing instructions were in wide use, one would expect existing code to be somewhat slowed down. However (1) I suspect the trend towards simpler OOO instructions has already made these "add to memory" instructions slower, while on the other hand (2) advanced implementations make such atomic instructions neglibly slower than their non-atomic versions. </UL> = [[Atomicity is always relative]] = IBM literature says: "as observed by other CPUs and the channel subsystem, ... appear to occur atomically". IBM also uses the phrase "block-concurrent interlocked update". "Block concurrent" is a special IBM term related to memory ordering, that says that all bytes are accessed atomically as observed by other CPUs. However, they may be observed a byte at a time by channel programs... but "interlocked" means that channel program requests cannot be inserted between the bytes. Bottom line: atomixc wrt CPUs and I/O channels. = Reference = Many references, scattered. ;[share] : New CPU Facilities in the IBM zEnterprise 196, share.confex.com, 4 August 2010, http://share.confex.com/share/115/webprogram/Handout/Session7034/New%20CPU%20Facilities%20in%20the%20z196.pdf
[toc] | [next] | [standalone]
| From | "Paul A. Clayton" <paaronclayton@gmail.com> |
|---|---|
| Date | 2012-03-07 05:43 -0800 |
| Message-ID | <9a6b0bf4-9cbc-4dc9-a2dd-9aa34f078f65@d17g2000vba.googlegroups.com> |
| In reply to | #6282 |
On Mar 6, 11:20 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote: [snip] > It is a bit unusual to be able to extend an existing instruction in this > way. If these existing instructions were in wide use, one would expect > existing code to be somewhat slowed down. > However (1) I suspect the trend towards simpler OOO instructions has > already made these "add to memory" instructions slower, > while on the other hand (2) advanced implementations make such atomic > instructions neglibly slower than their non-atomic versions. Why would guaranteeing the atomicity of aligned add immediate to memory make such noticeably slower in the uncontended case which would be safe without atomicity? In the contended case (where the exclusivity is lost between issue and commit or not acquired first [if the load was speculatively performed with a shared state] and another agent wrote to the cache block before exclusivity was acquired and the store commited), replay might be used, perhaps with a NAK response (if the commit delay is sufficiently short, possibly delaying replay until the load can issue in-order) to hold off contending accesses. The contended case (which should not be present in existing software that lacks the atomicity guarantee?) might be a little slower (having to reload the value and perform the addition but the store commit would already have been delayed by the read-for-ownership operation), but even this slow-down might not be especially bad. (Such would seem to add complexity, but it seems that the complexity would not be on a major critical path/loop.) A side-effect might be that such could facilitate porting some single-threaded applications to multithreading?
[toc] | [prev] | [next] | [standalone]
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-03-07 07:00 -0800 |
| Message-ID | <4F5777FD.4090505@SPAM.comp-arch.net> |
| In reply to | #6287 |
On 3/7/2012 5:43 AM, Paul A. Clayton wrote: > On Mar 6, 11:20 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net> > wrote: > [snip] >> It is a bit unusual to be able to extend an existing instruction in this >> way. If these existing instructions were in wide use, one would expect >> existing code to be somewhat slowed down. >> However (1) I suspect the trend towards simpler OOO instructions has >> already made these "add to memory" instructions slower, >> while on the other hand (2) advanced implementations make such atomic >> instructions neglibly slower than their non-atomic versions. > > Why would guaranteeing the atomicity of aligned add immediate > to memory make such noticeably slower in the uncontended case > which would be safe without atomicity? It doesn't need to, as I think I noted. Ideally, an uncontended atomic RMW should have exactly the same performance as a non-atomic RMW. However, Intel has only recently begun approaching that ideal, even though I started down that path circa 1991. Hmm... as I write this I may have realized that IBM z-Series's strong memory ordering model, which some people say is stronger[*] than Intel's ostensible TSO, may have made it easier for IBM to acheive this. Or, rather, perhaps just as hard, but necessary to be IBM mainframe compatible. (Note [*]: Many people have said that IBM's memory ordering model is really strong ordering, but when I read the Principles of Operation I don't see this - I think weaker models like TSO are permitted according to my reading of the POP.) Anyway... Prior to P6, locked operations (a) bypassed the cache, (b) asserted a bus lock signal. For that matter, (c) were done non-speculatively, although most things were at that time. (d) drained store buffers. All of these things made a locked RMW considerably more expensive. P6 started doing locked RMWs in the cache, using the cache protocol. Instead of locking the bus, it would delay snoops while the locked RMW was going on. P6 also made the locked RMW fencing - no later loads could pass it, the store could not be queued up. This was required because of Intel's memory ordering model - and if IBM has a higher performance implementation to cope with its stronger memory ordering model, they may already have solved that problem. (Actually, it would have been correct to allow later loads to pass the load-locked, and rely on the MOB snooping (aka Frey rule consistency - Brad Frey is from IBM) to detect problems. P6 chose not to do this largely because of FUD - justified, as it turned out, by some serious bugs that making the RMW fencing worked around. Furthermore, P6 did not do the load part of the atomic RMW speculatively. These amounted to making locked RMWs considerably more expensive than unlocked, even if no contention.) Over the years these constraints have slowly been removed. I can't give a full timeline, but they had not completely gone away when I left Intel. I would be surprised, but happy, if they had been completely eliminated by SandyBridge and/or IvyBridge - or, for that matter, by Haswell. Allowing later loads to pass the locked RMW is only the first step. A more interesting question is whether to do the load-locked part early, speculatively, or not. If no contention, then doing the load-locked speculatively is not a problem. However, if the way you are handling things is to block snoops to the cache line(s) involved between the load-locked and store-unlocked parts of the RMW, then doing the load-locked early rather than at the last minute increases the window in which the line is locked - thereby increasing the window of time for contention. Which can potentially be hundreds of cycles, rather than the 1-10 cycles it might be locked if done non-speculatively. Equivalently, if you cancel or redo the load-locked via an expensive mechanism if contention, then increasing the window is a bad thing. A nice approach is to prefetch, and do the load-locked at the last minute... but that loses some of the benefit of doing the load-locked speculatively. Probably the best thing is a combination of prefetch, doing the load-locked speculatively, combined with an efficient redo/replay mechanism. (Not "nuke the machine and start over"). Note that it is not uncommon for there to be operations dependent on the load-locked value. Apart from all of that stuff, for Intel the putative requirement to drain the store buffer before the store-unlock is an impediment even in the uncontended case. However, IBM's putative strong ordering model requires them to behave as if this is done on all stores, and before all loads. So this is the place where locked RMWs may be relatively slower at Intel than IBM. I say putative, because there are several techniques to to allow store buffering even on SC (e.g. you can buffer stores if you already own the lines; you can even buffer if you don't own the lines, so long as you can make certain guarantees about keeping other processors out (which is hard to do without deadlock), and/or starting from a checkpoint corresponding to a store in the buffer (which amounts to saying that you aren't really buffering after architectural commit). Anyway, while it is possible to make a locked atomic RMW as efficient as an unlocked RMW, particularly in the no-contention case, it has taken Intel two decades to get there. I would be a bit surprised if IBM got there right away, on what appears to be their first OOO mainframe in a long time. But, I know, I know - IBM has lots and lots of history. Still, I am willing to place a small bet that my conjecture is correct.
[toc] | [prev] | [next] | [standalone]
| From | nmm1@cam.ac.uk |
|---|---|
| Date | 2012-03-07 15:25 +0000 |
| Message-ID | <jj7ukn$pie$1@gosset.csi.cam.ac.uk> |
| In reply to | #6288 |
In article <4F5777FD.4090505@SPAM.comp-arch.net>, Andy (Super) Glew <andy@SPAM.comp-arch.net> wrote: >> >> Why would guaranteeing the atomicity of aligned add immediate >> to memory make such noticeably slower in the uncontended case >> which would be safe without atomicity? > >It doesn't need to, as I think I noted. Ideally, an uncontended atomic >RMW should have exactly the same performance as a non-atomic RMW. > >However, Intel has only recently begun approaching that ideal, even >though I started down that path circa 1991. > >Hmm... as I write this I may have realized that IBM z-Series's strong >memory ordering model, which some people say is stronger[*] than Intel's >ostensible TSO, may have made it easier for IBM to acheive this. Or, >rather, perhaps just as hard, but necessary to be IBM mainframe compatible. > >(Note [*]: Many people have said that IBM's memory ordering model is >really strong ordering, but when I read the Principles of Operation I >don't see this - I think weaker models like TSO are permitted according >to my reading of the POP.) A brief check of it leaves me none the wiser - i.e. I don't think that it says anything, which leaves the question up in the air. If I have missed something in it, a page reference would be useful! My analysis of this problem is that it is just beginning to bite, because I can't think of a way where IRIW can be implemented both scalably and efficiently. Atomicity is easy; global consistency is hard. Regards, Nick Maclaren.
[toc] | [prev] | [next] | [standalone]
| From | "Paul A. Clayton" <paaronclayton@gmail.com> |
|---|---|
| Date | 2012-03-09 11:05 -0800 |
| Message-ID | <cfa29bc2-261d-42bb-b928-a2f4430f2425@do4g2000vbb.googlegroups.com> |
| In reply to | #6288 |
On Mar 7, 10:00 am, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote: > On 3/7/2012 5:43 AM, Paul A. Clayton wrote: > > > On Mar 6, 11:20 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net> > > wrote: > > [snip] > >> It is a bit unusual to be able to extend an existing instruction in this > >> way. If these existing instructions were in wide use, one would expect > >> existing code to be somewhat slowed down. > >> However (1) I suspect the trend towards simpler OOO instructions has > >> already made these "add to memory" instructions slower, > >> while on the other hand (2) advanced implementations make such atomic > >> instructions neglibly slower than their non-atomic versions. > > > Why would guaranteeing the atomicity of aligned add immediate > > to memory make such noticeably slower in the uncontended case > > which would be safe without atomicity? > > It doesn't need to, as I think I noted. Ideally, an uncontended atomic > RMW should have exactly the same performance as a non-atomic RMW. I guess I was interpreting "advanced implementations" as a higher barrier. "Advanced" is usually dependent on the advancement of generally available knowledge (and tools) and on the intellectual and other resources of the implementers. Advanced for IBM (or Intel) in 2012 is different from advanced for a college freshman in 2012 (much less in 1985). I was thinking the former (although in context, it would be more IBM in about 2008--guessing that such would have been about the cut off for reasonably committing to the atomic implementation of such for a 2010 product). (Of course, IBM is also more constrained by reliability demands for zSeries (and perhaps even POWER) than Intel is for even server x86. > However, Intel has only recently begun approaching that ideal, even > though I started down that path circa 1991. > > Hmm... as I write this I may have realized that IBM z-Series's strong > memory ordering model, which some people say is stronger[*] than Intel's > ostensible TSO, may have made it easier for IBM to acheive this. Or, > rather, perhaps just as hard, but necessary to be IBM mainframe compatible. I was thinking that a stronger memory consistency model might have allowed such to require less _additional_ effort, but I was not certain. (David Kanter's somewhat recent article about the z196 processor [ http://www.realworldtech.com/page.cfm?ArticleID=RWT010312153140 ] revealed something that surprised me: it-and the previous zSeries processor used a *write-through* L2. It is not clear that such an implementation detail might not have also made implementing relatively low-additional- overhead atomic RMW operations less difficult.) [snip] > Prior to P6, locked operations (a) bypassed the cache, (b) asserted a > bus lock signal. For that matter, (c) were done non-speculatively, > although most things were at that time. (d) drained store buffers. All > of these things made a locked RMW considerably more expensive. On the other hand, only locked RMW operations required such overhead (whereas even ordinary zSeries LOAD AND operate instructions would suffer the penalty, so the extra effort could be more justified [aside from the relative riches of transistors, knowledge, and tools available now versus then]). [snip lots of interesting implementation information] > Probably the best thing is a combination of prefetch, doing the > load-locked speculatively, combined with an efficient redo/replay > mechanism. (Not "nuke the machine and start over"). Note that it is not > uncommon for there to be operations dependent on the load-locked value. And the operations are probably often related to control dependencies (and these control dependencies might(??) be less predictable in the case of contention). [snip] > Anyway, while it is possible to make a locked atomic RMW as efficient as > an unlocked RMW, particularly in the no-contention case, it has taken > Intel two decades to get there. I would be a bit surprised if IBM got > there right away, on what appears to be their first OOO mainframe in a > long time. But, I know, I know - IBM has lots and lots of history. > Still, I am willing to place a small bet that my conjecture is correct. I am inclined to think that IBM did not add a substantial penalty for the no-contention case (even if such has more overhead than earlier, non-atomic implementations), but even using the prefetch and non-speculative RWM method might not have extreme overhead and might make reliability easier. I.e., you are probably correct (not "as efficient as an unlocked RMW"), but the penalty might not be that bad (and may be hidden by the other performance-improving features like OoO execution and on-chip L3 cache).
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <"terje.mathisen at tmsw.no"> |
|---|---|
| Date | 2012-03-07 20:00 +0100 |
| Message-ID | <5rnk29-riq.ln1@ntp6.tmsw.no> |
| In reply to | #6282 |
Andy (Super) Glew wrote: > http://semipublic.comp-arch.net/wiki/Interlocked-access_facility > > = New IBM z196 synchronization instructions = > > The IBM z196 adds new [[synchronization instructions]] to the [[IBM > System z Mainframe ISA]]. > > Augmenting the legacy instructions > * [[COMPARE AND SWAP]] > * [[PERFORM LOCKED OPERATION]] > * [[TEST AND SET]] > > The reference [share] comments "there is no need for a COMPARE AND SWAP > loop to perform these operations!" > - their exclamation mark! > This suggests the motivation for these instructions > - while [[compare and swap]] is one of the most powerful synchronization > instructions, > it is not necessarily efficient. > Atomic operations such as [[atomic add to memory]] can perform in one > instruction, > without a loop, things that would require looping for [[compare and > swap]], [[test-and-set]], and [[load-linked store-conditional]]. > Looping that may require special mechanisms to guarantee forward progress. As you've seen here many times, I really like fetch&add (XADD), it is pretty much the best I can do without specialized hw that both guarantees forward progress and automatically prioritizes the threads/processes that didn't "win" on the first attempt. (Hello Mitch!) > The z196 [[interlocked-access facility]] instructions include > > New atomic instructions: > * [[LOAD AND ADD]] > * [[LOAD AND ADD LOGICAL]] > * [[LOAD AND AND]] > * [[LOAD AND EXCLUSIVE OR]] > * [[LOAD AND OR]] I have some problem figuring out the intended usage for the AND/OR/XOR versions of this operation: The key benefit of XADD is that N threads can use it at once, each of them getting a separate return value that they can use to select what to do next, i.e. pick entry N from the work-to-do list. Doing the same with a logical operation would seem to negate this advantage, you lose the ability to get unique results for each process? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-03-07 13:43 -0800 |
| Message-ID | <1432328.1752.1331156586117.JavaMail.geo-discussion-forums@ynkz21> |
| In reply to | #6292 |
On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
> I have some problem figuring out the intended usage for the AND/OR/XOR
> versions of this operation:
XOR is useful for setting bits and resetting bits (from the same constant).
So one can "check in their request" and "checkout their request" using the same value. Otherwise one has to ADD a bit to check in a request and then ADD in the 2's complement to check it back out (or OR on a bit and AND-NOT it back off).
Most of the statemachine logic in my simulators uses XOR as in:
if( p->state & STATE1 )
p->state ^= STATE1 | STATE2;
else
// where it is known STATE1 and STATE2 have no common bits
// but may have more than one bit set.
Mitch
[toc] | [prev] | [next] | [standalone]
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-03-08 09:08 -0800 |
| Message-ID | <4F58E783.2080205@SPAM.comp-arch.net> |
| In reply to | #6295 |
On 3/7/2012 1:43 PM, MitchAlsup wrote:
> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
>> I have some problem figuring out the intended usage for the AND/OR/XOR
>> versions of this operation:
>
> XOR is useful for setting bits and resetting bits (from the same constant).
>
> So one can "check in their request" and "checkout their request" using the same value. Otherwise one has to ADD a bit to check in a request and then ADD in the 2's complement to check it back out (or OR on a bit and AND-NOT it back off).
>
> Most of the statemachine logic in my simulators uses XOR as in:
>
> if( p->state& STATE1 )
> p->state ^= STATE1 | STATE2;
> else
>
> // where it is known STATE1 and STATE2 have no common bits
> // but may have more than one bit set.
>
> Mitch
Right. But do you need that to be atomic?
Can you have multiple threads using the same p->state at the same time?
Setting the same bits?
Or possibly setting disjoint bits in the same word?
To get XOR the safely set and clear different bits, you have to know the
prior state. Perhaps and AND-OR instruction?
AND-OR
tmp := mem
tmp &= and_mask (or ~and_mask)
tmp |= or_mask
mem := tmp
A nice instruction, but it really wants two immediates.
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <"terje.mathisen at tmsw.no"> |
|---|---|
| Date | 2012-03-08 18:39 +0100 |
| Message-ID | <2f7n29-nmt.ln1@ntp6.tmsw.no> |
| In reply to | #6304 |
Andy (Super) Glew wrote: > On 3/7/2012 1:43 PM, MitchAlsup wrote: >> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote: >>> I have some problem figuring out the intended usage for the AND/OR/XOR >>> versions of this operation: >> >> XOR is useful for setting bits and resetting bits (from the same >> constant). >> >> So one can "check in their request" and "checkout their request" using >> the same value. Otherwise one has to ADD a bit to check in a request >> and then ADD in the 2's complement to check it back out (or OR on a >> bit and AND-NOT it back off). >> >> Most of the statemachine logic in my simulators uses XOR as in: >> >> if( p->state& STATE1 ) >> p->state ^= STATE1 | STATE2; >> else >> >> // where it is known STATE1 and STATE2 have no common bits >> // but may have more than one bit set. >> >> Mitch > > > Right. But do you need that to be atomic? > > Can you have multiple threads using the same p->state at the same time? > Setting the same bits? > > Or possibly setting disjoint bits in the same word? > > To get XOR the safely set and clear different bits, you have to know the > prior state. Perhaps and AND-OR instruction? > > AND-OR > tmp := mem > tmp &= and_mask (or ~and_mask) > tmp |= or_mask > mem := tmp > > A nice instruction, but it really wants two immediates. > Exactly, it is hard to figure out a nice usage scenario for the existing logical versions. If you want to share a bitmask between multiple processes, simultaneously updating disjoint sits of bits, it seems like a classic CAS is more natural. OTOH, if those bits really are disjoint, then it would be far better to store them in separate cache lines! A common mask only makes sense if there are many readers of the full mask. The intended use would seem to require two such operations for each update, one to turn off bits and another to turn on some other bits, making the combined update non-atomic. The alternative if each bit can only have a single writer would be to use the atomic XOR on the bits to be modified. (BTW, can you always turn such an XOR (when the current value of those bits are known) into an XADD of a magic value?) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Nomen Nescio <nobody@dizum.com> |
|---|---|
| Date | 2012-03-08 22:33 +0100 |
| Message-ID | <a6f6e132ebfe56cbb374b441865c7c9f@dizum.com> |
| In reply to | #6305 |
> > To get XOR the safely set and clear different bits, you have to know the > > prior state. Yes, this is it. Until now setting flag bits atomically required a macro like OIL, NIL (see IBM z/OS Assembler Services Reference). The macros required a work area and wrapped a CS around the logical operation. If I understand the issue (haven't gone through this thread carefully and haven't reviewed the z196 POPS but I have used OIL, NIL etc) then this allows the assembler coder to do this more compactly and simply than by invoking a macro. This functionality is required in a typical multiprocessing scenario where multiple tasks are testing and setting flags and where illogical combinations of flags must be avoided.
[toc] | [prev] | [next] | [standalone]
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-03-08 17:17 -0800 |
| Message-ID | <4F595A1B.9050100@SPAM.comp-arch.net> |
| In reply to | #6305 |
On 3/8/2012 9:39 AM, Terje Mathisen wrote: > Andy (Super) Glew wrote: >> On 3/7/2012 1:43 PM, MitchAlsup wrote: >>> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote: >>>> I have some problem figuring out the intended usage for the AND/OR/XOR >>>> versions of this operation: >>> >>> XOR is useful for setting bits and resetting bits (from the same >>> constant). >> To get XOR the safely set and clear different bits, you have to know the >> prior state. > > Exactly, it is hard to figure out a nice usage scenario for the existing > logical versions. Some simple checksums are simple XORs. Without shifting. This can be parallelized. Hmm, a one time pad may simply XOR. Or perhaps a step in breaking some crypto is applying lots of XORs...
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-03-09 08:19 -0800 |
| Message-ID | <15229834.761.1331309987212.JavaMail.geo-discussion-forums@ynbq18> |
| In reply to | #6305 |
On Thursday, March 8, 2012 11:39:13 AM UTC-6, Terje Mathisen wrote: > Andy (Super) Glew wrote: > > On 3/7/2012 1:43 PM, MitchAlsup wrote: > >> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote: > >>> I have some problem figuring out the intended usage for the AND/OR/XOR > >>> versions of this operation: > >> > >> XOR is useful for setting bits and resetting bits (from the same > >> constant). > >> > >> So one can "check in their request" and "checkout their request" using > >> the same value. Otherwise one has to ADD a bit to check in a request > >> and then ADD in the 2's complement to check it back out (or OR on a > >> bit and AND-NOT it back off). > >> > >> Most of the statemachine logic in my simulators uses XOR as in: > >> > >> if( p->state& STATE1 ) > >> p->state ^= STATE1 | STATE2; > >> else > >> > >> // where it is known STATE1 and STATE2 have no common bits > >> // but may have more than one bit set. > >> > >> Mitch > > > > > > Right. But do you need that to be atomic? > > > > Can you have multiple threads using the same p->state at the same time? > > Setting the same bits? > > > > Or possibly setting disjoint bits in the same word? > > > > To get XOR the safely set and clear different bits, you have to know the > > prior state. Perhaps and AND-OR instruction? > > > > AND-OR > > tmp := mem > > tmp &= and_mask (or ~and_mask) > > tmp |= or_mask > > mem := tmp > > > > A nice instruction, but it really wants two immediates. > > > > Exactly, it is hard to figure out a nice usage scenario for the existing > logical versions. I have seen a parallel processing implementation that used the bit to denote which processor was requesting/waiting-for the lock (kernel only). As long as the number of processors is smaller than wordsize this works well. Mitch
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <"terje.mathisen at tmsw.no"> |
|---|---|
| Date | 2012-03-09 17:45 +0100 |
| Message-ID | <jlop29-ll21.ln1@ntp6.tmsw.no> |
| In reply to | #6314 |
MitchAlsup wrote: > On Thursday, March 8, 2012 11:39:13 AM UTC-6, Terje Mathisen wrote: >> Exactly, it is hard to figure out a nice usage scenario for the >> existing logical versions. > > I have seen a parallel processing implementation that used the bit to > denote which processor was requesting/waiting-for the lock (kernel > only). As long as the number of processors is smaller than wordsize > this works well. This dovetails nicely with the one scenario I did find: When you have lots of processes sharing a bitmap, but each individual bit can only have a single writer: In this case said owner can use AND to clear it, ORto set it and XOR to toggle it. In fact, in this scenario the owner will always have a local copy, so even without reading it first, the XOR version is sufficient for all cases where the bit must be modified. How is this (bus transaction-wise) different from a simple LOCK XOR [bitmask], my_bits_to_toggle ? You can also use XADD with +/- my_bits_to_toggle to get the same effect, can't you? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | jgk@panix.com (Joe keane) |
|---|---|
| Date | 2012-03-09 20:30 +0000 |
| Message-ID | <jjdp9u$l77$1@reader1.panix.com> |
| In reply to | #6292 |
In article <5rnk29-riq.ln1@ntp6.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >I have some problem figuring out the intended usage for the AND/OR/XOR >versions of this operation: Maybe it takes more transistors to not implement them.
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-03-09 13:10 -0800 |
| Message-ID | <32946367.1627.1331327445478.JavaMail.geo-discussion-forums@ynne2> |
| In reply to | #6318 |
On Friday, March 9, 2012 2:30:54 PM UTC-6, Joe keane wrote: > Maybe it takes more transistors to not implement them. It probably takes more transistors in the decode stage of the pipeline to decode the new instructions that all the transistors used in the "performance" of the instruction at the memory. Mitch
[toc] | [prev] | [standalone]
Back to top | Article view | comp.arch
csiph-web