Groups > comp.arch > #6282 > unrolled thread

Ne(ish) IBM z196 synchronization instructions

Started by	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
First post	2012-03-06 20:20 -0800
Last post	2012-03-09 13:10 -0800
Articles	15 — 7 participants

Back to article view | Back to comp.arch

  Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-06 20:20 -0800
    Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-07 05:43 -0800
      Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-07 07:00 -0800
        Re: Ne(ish) IBM z196 synchronization instructions nmm1@cam.ac.uk - 2012-03-07 15:25 +0000
        Re: Ne(ish) IBM z196 synchronization instructions "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-09 11:05 -0800
    Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-07 20:00 +0100
      Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-07 13:43 -0800
        Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 09:08 -0800
          Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-08 18:39 +0100
            Re: Ne(ish) IBM z196 synchronization instructions Nomen Nescio <nobody@dizum.com> - 2012-03-08 22:33 +0100
            Re: Ne(ish) IBM z196 synchronization instructions "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-03-08 17:17 -0800
            Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 08:19 -0800
              Re: Ne(ish) IBM z196 synchronization instructions Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-03-09 17:45 +0100
      Re: Ne(ish) IBM z196 synchronization instructions jgk@panix.com (Joe keane) - 2012-03-09 20:30 +0000
        Re: Ne(ish) IBM z196 synchronization instructions MitchAlsup <MitchAlsup@aol.com> - 2012-03-09 13:10 -0800

#6282 — Ne(ish) IBM z196 synchronization instructions

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-03-06 20:20 -0800
Subject	Ne(ish) IBM z196 synchronization instructions
Message-ID	<4F56E21F.2010903@SPAM.comp-arch.net>

http://semipublic.comp-arch.net/wiki/Interlocked-access_facility

= New IBM z196 synchronization instructions =

The IBM z196 adds new [[synchronization instructions]] to the [[IBM 
System z Mainframe ISA]].

Augmenting the legacy instructions
* [[COMPARE AND SWAP]]
* [[PERFORM LOCKED OPERATION]]
* [[TEST AND SET]]

The reference [share] comments "there is no need for a COMPARE AND SWAP 
loop to perform these operations!"
- their exclamation mark!
This suggests the motivation for these instructions
- while [[compare and swap]] is one of the most powerful synchronization 
instructions,
it is not necessarily efficient.
Atomic operations such as [[atomic add to memory]] can perform in one 
instruction,
without a loop, things that would require looping for [[compare and 
swap]], [[test-and-set]], and [[load-linked store-conditional]].
Looping that may require special mechanisms to guarantee forward progress.

The z196 [[interlocked-access facility]] instructions include

New atomic instructions:
* [[LOAD AND ADD]]
* [[LOAD AND ADD LOGICAL]]
* [[LOAD AND AND]]
* [[LOAD AND EXCLUSIVE OR]]
* [[LOAD AND OR]]
<UL>
The "LOAD AND ..." part of these instructions' names is a bit misleading.
These are really just [[fetch-and-op]] instructions to memory.
Fetching the old value, performing the operation with a register input,
storing the new value so produced,
and returning the old value in a register.

Flavours: add signed/unsigned, logical and/or/xor, 32/64 bits wide.

</UL>

An interesting instruction that I feel must be called [[pseudo-atomic]], 
in much the same way [[LL/SC]] is [[pseudo-atomic]]:
* [[LOAD PAIR DISJOINT]]
<UL>
[[LOAD PAIR DISJOINT]] loads two separate, non-contiguous, memory 
locations (each of which must be naturally aligned),
into an [[even/odd register pair]].
It sets condition codes to indicate whether the fetch was atomic,
or whether some other CPU or channel managed to sneak in a store between 
them.
(The language in the [share] presentation suggests that the condition 
codes are only set if a store actually occurred,
not "may have occurred" - since if thye latter, a correct implementation 
might always returned "failed to be atomic".)

GLEW COMMENT: although undoubtedly most of the time such actions are 
atomic, it is not clear to me that there is any guarantee of forward 
progress,
for a loop around [[LOAD PAIR DISJOINT]] that spins until atomicity is 
observed.

I almost suspect that there may be plans to eventually provide some such 
guarantees,
but that the detail of such guarantees does not want to be cast in 
concrete architecture at the time of introduction.
</UL>

In addition, the existing instructions that perform [[add immediate to 
memory]], in signed/unsigned and 32b/64b forms, are declared to be atomic
when "the storage operand is aligned on an [[integral boundary]]" - IBM 
speak for [[natural alignment]].
* [[ADD IMMEDIATE]]
* [[ADD LOGICAL WITH SIGNED IMMEDIATE]]

<UL>
GLEW COMMENT: it is obviously desirable to be able to have a 
fetch-and-add immediate to memory, to avoid having to blow a register on 
the operand for the atomic modification.

It is a bit unusual to be able to extend an existing instruction in this 
way.  If these existing instructions were in wide use, one would expect 
existing code to be somewhat slowed down.
However (1) I suspect the trend towards simpler OOO instructions has 
already made these "add to memory" instructions slower,
while on the other hand (2) advanced implementations make such atomic 
instructions neglibly slower than their non-atomic versions.

</UL>

= [[Atomicity is always relative]] =

IBM literature says: "as observed by other CPUs and the channel 
subsystem, ... appear to occur atomically".

IBM also uses the phrase "block-concurrent interlocked update".  "Block 
concurrent" is a special IBM term related to memory ordering, that says 
that all bytes are accessed atomically as observed by other CPUs. 
However, they may be observed a byte at a time by channel programs... 
but "interlocked" means that channel program requests cannot be inserted 
between the bytes.
Bottom line: atomixc wrt CPUs and I/O channels.

= Reference =

Many references, scattered.

;[share]
: New CPU Facilities in the IBM zEnterprise 196, share.confex.com, 4 
August 2010, 
http://share.confex.com/share/115/webprogram/Handout/Session7034/New%20CPU%20Facilities%20in%20the%20z196.pdf

[toc] | [next] | [standalone]

#6287

From	"Paul A. Clayton" <paaronclayton@gmail.com>
Date	2012-03-07 05:43 -0800
Message-ID	<9a6b0bf4-9cbc-4dc9-a2dd-9aa34f078f65@d17g2000vba.googlegroups.com>
In reply to	#6282

On Mar 6, 11:20 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]
> It is a bit unusual to be able to extend an existing instruction in this
> way.  If these existing instructions were in wide use, one would expect
> existing code to be somewhat slowed down.
> However (1) I suspect the trend towards simpler OOO instructions has
> already made these "add to memory" instructions slower,
> while on the other hand (2) advanced implementations make such atomic
> instructions neglibly slower than their non-atomic versions.

Why would guaranteeing the atomicity of aligned add immediate
to memory make such noticeably slower in the uncontended case
which would be safe without atomicity?  In the contended case
(where the exclusivity is lost between issue and commit or
not acquired first [if the load was speculatively performed
with a shared state] and another agent wrote to the cache
block before exclusivity was acquired and the store commited),
replay might be used, perhaps with a NAK response (if the
commit delay is sufficiently short, possibly delaying replay
until the load can issue in-order) to hold off contending
accesses.  The contended case (which should not be present
in existing software that lacks the atomicity guarantee?)
might be a little slower (having to reload the value and
perform the addition but the store commit would already have
been delayed by the read-for-ownership operation), but even
this slow-down might not be especially bad.  (Such would
seem to add complexity, but it seems that the complexity
would not be on a major critical path/loop.)

A side-effect might be that such could facilitate porting
some single-threaded applications to multithreading?

[toc] | [prev] | [next] | [standalone]

#6288

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-03-07 07:00 -0800
Message-ID	<4F5777FD.4090505@SPAM.comp-arch.net>
In reply to	#6287

On 3/7/2012 5:43 AM, Paul A. Clayton wrote:
> On Mar 6, 11:20 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net>
> wrote:
> [snip]
>> It is a bit unusual to be able to extend an existing instruction in this
>> way.  If these existing instructions were in wide use, one would expect
>> existing code to be somewhat slowed down.
>> However (1) I suspect the trend towards simpler OOO instructions has
>> already made these "add to memory" instructions slower,
>> while on the other hand (2) advanced implementations make such atomic
>> instructions neglibly slower than their non-atomic versions.
>
> Why would guaranteeing the atomicity of aligned add immediate
> to memory make such noticeably slower in the uncontended case
> which would be safe without atomicity?

It doesn't need to, as I think I noted.  Ideally, an uncontended atomic 
RMW should have exactly the same performance as a non-atomic RMW.

However, Intel has only recently begun approaching that ideal, even 
though I started down that path circa 1991.

Hmm... as I write this I may have realized that IBM z-Series's strong 
memory ordering model, which some people say is stronger[*] than Intel's 
ostensible TSO, may have made it easier for IBM to acheive this.  Or, 
rather, perhaps just as hard, but necessary to be IBM mainframe compatible.

(Note [*]: Many people have said that IBM's memory ordering model is 
really strong ordering, but when I read the Principles of Operation I 
don't see this - I think weaker models like TSO are permitted according 
to my reading of the POP.)

Anyway...

Prior to P6, locked operations (a) bypassed the cache, (b) asserted a 
bus lock signal.  For that matter, (c) were done non-speculatively, 
although most things were at that time. (d) drained store buffers. All 
of these things made a locked RMW considerably more expensive.

P6 started doing locked RMWs in the cache, using the cache protocol. 
Instead of locking the bus, it would delay snoops while the locked RMW 
was going on.  P6 also made the locked RMW fencing - no later loads 
could pass it, the store could not be queued up. This was required 
because of Intel's memory ordering model - and if IBM has a higher 
performance implementation to cope with its stronger memory ordering 
model, they may already have solved that problem.

(Actually, it would have been correct to allow later loads to pass the 
load-locked, and rely on the MOB snooping (aka Frey rule consistency - 
Brad Frey is from IBM) to detect problems.  P6 chose not to do this 
largely because of FUD - justified, as it turned out, by some serious 
bugs that making the RMW fencing worked around. Furthermore, P6 did not 
do the load part of the atomic RMW speculatively.  These amounted to 
making locked RMWs considerably more expensive than unlocked, even if no 
contention.)

Over the years these constraints have slowly been removed. I can't give 
a full timeline, but they had not completely gone away when I left 
Intel.  I would be surprised, but happy, if they had been completely 
eliminated by SandyBridge and/or IvyBridge - or, for that matter, by 
Haswell.

Allowing later loads to pass the locked RMW is only the first step.

A more interesting question is whether to do the load-locked part early, 
speculatively, or not.

If no contention, then doing the load-locked speculatively is not a 
problem.  However, if the way you are handling things is to block snoops 
to the cache line(s) involved between the load-locked and store-unlocked 
parts of the RMW, then doing the load-locked early rather than at the 
last minute increases the window in which the line is locked - thereby 
increasing the window of time for contention.  Which can potentially be 
hundreds of cycles, rather than the 1-10 cycles it might be locked if 
done non-speculatively. Equivalently, if you cancel or redo the 
load-locked via an expensive mechanism if contention, then increasing 
the window is a bad thing.  A nice approach is to prefetch, and do the 
load-locked at the last minute... but that loses some of the benefit of 
doing the load-locked speculatively.

Probably the best thing is a combination of prefetch, doing the 
load-locked speculatively, combined with an efficient redo/replay 
mechanism. (Not "nuke the machine and start over"). Note that it is not 
uncommon for there to be operations dependent on the load-locked value.

Apart from all of that stuff, for Intel the putative requirement to 
drain the store buffer before the store-unlock is an impediment even in 
the uncontended case. However, IBM's putative strong ordering model 
requires them to behave as if this is done on all stores, and before all 
loads.  So this is the place where locked RMWs may be relatively slower 
at Intel than IBM. I say putative, because there are several techniques 
to to allow store buffering even on SC (e.g. you can buffer stores if 
you already own the lines; you can even buffer if you don't own the 
lines, so long as you can make certain guarantees about keeping other 
processors out (which is hard to do without deadlock), and/or starting 
from a checkpoint corresponding to a store in the buffer  (which amounts 
to saying that you aren't really buffering after architectural commit).

Anyway, while it is possible to make a locked atomic RMW as efficient as 
an unlocked RMW, particularly in the no-contention case, it has taken 
Intel two decades to get there.  I would be a bit surprised if IBM got 
there right away, on what appears to be their first OOO mainframe in a 
long time.  But, I know, I know - IBM has lots and lots of history. 
Still, I am willing to place a small bet that my conjecture is correct.

[toc] | [prev] | [next] | [standalone]

#6289

From	nmm1@cam.ac.uk
Date	2012-03-07 15:25 +0000
Message-ID	<jj7ukn$pie$1@gosset.csi.cam.ac.uk>
In reply to	#6288

In article <4F5777FD.4090505@SPAM.comp-arch.net>,
Andy (Super) Glew <andy@SPAM.comp-arch.net> wrote:
>>
>> Why would guaranteeing the atomicity of aligned add immediate
>> to memory make such noticeably slower in the uncontended case
>> which would be safe without atomicity?
>
>It doesn't need to, as I think I noted.  Ideally, an uncontended atomic 
>RMW should have exactly the same performance as a non-atomic RMW.
>
>However, Intel has only recently begun approaching that ideal, even 
>though I started down that path circa 1991.
>
>Hmm... as I write this I may have realized that IBM z-Series's strong 
>memory ordering model, which some people say is stronger[*] than Intel's 
>ostensible TSO, may have made it easier for IBM to acheive this.  Or, 
>rather, perhaps just as hard, but necessary to be IBM mainframe compatible.
>
>(Note [*]: Many people have said that IBM's memory ordering model is 
>really strong ordering, but when I read the Principles of Operation I 
>don't see this - I think weaker models like TSO are permitted according 
>to my reading of the POP.)

A brief check of it leaves me none the wiser - i.e. I don't think
that it says anything, which leaves the question up in the air.
If I have missed something in it, a page reference would be useful!

My analysis of this problem is that it is just beginning to bite,
because I can't think of a way where IRIW can be implemented both
scalably and efficiently.  Atomicity is easy; global consistency
is hard.


Regards,
Nick Maclaren.

[toc] | [prev] | [next] | [standalone]

#6317

From	"Paul A. Clayton" <paaronclayton@gmail.com>
Date	2012-03-09 11:05 -0800
Message-ID	<cfa29bc2-261d-42bb-b928-a2f4430f2425@do4g2000vbb.googlegroups.com>
In reply to	#6288

On Mar 7, 10:00 am, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
> On 3/7/2012 5:43 AM, Paul A. Clayton wrote:
>
> > On Mar 6, 11:20 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net>
> > wrote:
> > [snip]
> >> It is a bit unusual to be able to extend an existing instruction in this
> >> way.  If these existing instructions were in wide use, one would expect
> >> existing code to be somewhat slowed down.
> >> However (1) I suspect the trend towards simpler OOO instructions has
> >> already made these "add to memory" instructions slower,
> >> while on the other hand (2) advanced implementations make such atomic
> >> instructions neglibly slower than their non-atomic versions.
>
> > Why would guaranteeing the atomicity of aligned add immediate
> > to memory make such noticeably slower in the uncontended case
> > which would be safe without atomicity?
>
> It doesn't need to, as I think I noted.  Ideally, an uncontended atomic
> RMW should have exactly the same performance as a non-atomic RMW.

I guess I was interpreting "advanced implementations" as a
higher barrier.  "Advanced" is usually dependent on the
advancement of generally available knowledge (and tools) and
on the intellectual and other resources of the implementers.
Advanced for IBM (or Intel) in 2012 is different from
advanced for a college freshman in 2012 (much less in 1985).
I was thinking the former (although in context, it would be
more IBM in about 2008--guessing that such would have been
about the cut off for reasonably committing to the atomic
implementation of such for a 2010 product).  (Of course,
IBM is also more constrained by reliability demands for
zSeries (and perhaps even POWER) than Intel is for even
server x86.

> However, Intel has only recently begun approaching that ideal, even
> though I started down that path circa 1991.
>
> Hmm... as I write this I may have realized that IBM z-Series's strong
> memory ordering model, which some people say is stronger[*] than Intel's
> ostensible TSO, may have made it easier for IBM to acheive this.  Or,
> rather, perhaps just as hard, but necessary to be IBM mainframe compatible.

I was thinking that a stronger memory consistency model
might have allowed such to require less _additional_
effort, but I was not certain.  (David Kanter's
somewhat recent article about the z196 processor
[ http://www.realworldtech.com/page.cfm?ArticleID=RWT010312153140 ]
revealed something that surprised me: it-and the previous
zSeries processor used a *write-through* L2.  It is not
clear that such an implementation detail might not have
also made implementing relatively low-additional-
overhead atomic RMW operations less difficult.)

[snip]
> Prior to P6, locked operations (a) bypassed the cache, (b) asserted a
> bus lock signal.  For that matter, (c) were done non-speculatively,
> although most things were at that time. (d) drained store buffers. All
> of these things made a locked RMW considerably more expensive.

On the other hand, only locked RMW operations required
such overhead (whereas even ordinary zSeries LOAD AND
operate instructions would suffer the penalty, so the
extra effort could be more justified [aside from the
relative riches of transistors, knowledge, and tools
available now versus then]).

[snip lots of interesting implementation information]

> Probably the best thing is a combination of prefetch, doing the
> load-locked speculatively, combined with an efficient redo/replay
> mechanism. (Not "nuke the machine and start over"). Note that it is not
> uncommon for there to be operations dependent on the load-locked value.

And the operations are probably often related to
control dependencies (and these control dependencies
might(??) be less predictable in the case of contention).

[snip]
> Anyway, while it is possible to make a locked atomic RMW as efficient as
> an unlocked RMW, particularly in the no-contention case, it has taken
> Intel two decades to get there.  I would be a bit surprised if IBM got
> there right away, on what appears to be their first OOO mainframe in a
> long time.  But, I know, I know - IBM has lots and lots of history.
> Still, I am willing to place a small bet that my conjecture is correct.

I am inclined to think that IBM did not add a substantial
penalty for the no-contention case (even if such has more
overhead than earlier, non-atomic implementations), but
even using the prefetch and non-speculative RWM method
might not have extreme overhead and might make reliability
easier.  I.e., you are probably correct (not "as efficient
as an unlocked RMW"), but the penalty might not be that
bad (and may be hidden by the other performance-improving
features like OoO execution and on-chip L3 cache).

[toc] | [prev] | [next] | [standalone]

#6292

From	Terje Mathisen <"terje.mathisen at tmsw.no">
Date	2012-03-07 20:00 +0100
Message-ID	<5rnk29-riq.ln1@ntp6.tmsw.no>
In reply to	#6282

Andy (Super) Glew wrote:
> http://semipublic.comp-arch.net/wiki/Interlocked-access_facility
>
> = New IBM z196 synchronization instructions =
>
> The IBM z196 adds new [[synchronization instructions]] to the [[IBM
> System z Mainframe ISA]].
>
> Augmenting the legacy instructions
> * [[COMPARE AND SWAP]]
> * [[PERFORM LOCKED OPERATION]]
> * [[TEST AND SET]]
>
> The reference [share] comments "there is no need for a COMPARE AND SWAP
> loop to perform these operations!"
> - their exclamation mark!
> This suggests the motivation for these instructions
> - while [[compare and swap]] is one of the most powerful synchronization
> instructions,
> it is not necessarily efficient.
> Atomic operations such as [[atomic add to memory]] can perform in one
> instruction,
> without a loop, things that would require looping for [[compare and
> swap]], [[test-and-set]], and [[load-linked store-conditional]].
> Looping that may require special mechanisms to guarantee forward progress.

As you've seen here many times, I really like fetch&add (XADD), it is 
pretty much the best I can do without specialized hw that both 
guarantees forward progress and automatically prioritizes the 
threads/processes that didn't "win" on the first attempt.

(Hello Mitch!)

> The z196 [[interlocked-access facility]] instructions include
>
> New atomic instructions:
> * [[LOAD AND ADD]]
> * [[LOAD AND ADD LOGICAL]]
> * [[LOAD AND AND]]
> * [[LOAD AND EXCLUSIVE OR]]
> * [[LOAD AND OR]]

I have some problem figuring out the intended usage for the AND/OR/XOR 
versions of this operation:

The key benefit of XADD is that N threads can use it at once, each of 
them getting a separate return value that they can use to select what to 
do next, i.e. pick entry N from the work-to-do list.

Doing the same with a logical operation would seem to negate this 
advantage, you lose the ability to get unique results for each process?

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#6295

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-03-07 13:43 -0800
Message-ID	<1432328.1752.1331156586117.JavaMail.geo-discussion-forums@ynkz21>
In reply to	#6292

On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
> I have some problem figuring out the intended usage for the AND/OR/XOR 
> versions of this operation:

XOR is useful for setting bits and resetting bits (from the same constant).

So one can "check in their request" and "checkout their request" using the same value. Otherwise one has to ADD a bit to check in a request and then ADD in the 2's complement to check it back out (or OR on a bit and AND-NOT it back off).

Most of the statemachine logic in my simulators uses XOR as in:

     if( p->state & STATE1 )
         p->state ^= STATE1 | STATE2;
     else

// where it is known STATE1 and STATE2 have no common bits 
// but may have more than one bit set.

Mitch

[toc] | [prev] | [next] | [standalone]

#6304

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-03-08 09:08 -0800
Message-ID	<4F58E783.2080205@SPAM.comp-arch.net>
In reply to	#6295

On 3/7/2012 1:43 PM, MitchAlsup wrote:
> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
>> I have some problem figuring out the intended usage for the AND/OR/XOR
>> versions of this operation:
>
> XOR is useful for setting bits and resetting bits (from the same constant).
>
> So one can "check in their request" and "checkout their request" using the same value. Otherwise one has to ADD a bit to check in a request and then ADD in the 2's complement to check it back out (or OR on a bit and AND-NOT it back off).
>
> Most of the statemachine logic in my simulators uses XOR as in:
>
>       if( p->state&  STATE1 )
>           p->state ^= STATE1 | STATE2;
>       else
>
> // where it is known STATE1 and STATE2 have no common bits
> // but may have more than one bit set.
>
> Mitch


Right.  But do you need that to be atomic?

Can you have multiple threads using the same p->state at the same time? 
  Setting the same bits?

Or possibly setting disjoint bits in the same word?

To get XOR the safely set and clear different bits, you have to know the 
prior state.  Perhaps and AND-OR instruction?

     AND-OR
           tmp := mem
           tmp &= and_mask (or ~and_mask)
           tmp |= or_mask
           mem := tmp

A nice instruction, but it really wants two immediates.

[toc] | [prev] | [next] | [standalone]

#6305

From	Terje Mathisen <"terje.mathisen at tmsw.no">
Date	2012-03-08 18:39 +0100
Message-ID	<2f7n29-nmt.ln1@ntp6.tmsw.no>
In reply to	#6304

Andy (Super) Glew wrote:
> On 3/7/2012 1:43 PM, MitchAlsup wrote:
>> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
>>> I have some problem figuring out the intended usage for the AND/OR/XOR
>>> versions of this operation:
>>
>> XOR is useful for setting bits and resetting bits (from the same
>> constant).
>>
>> So one can "check in their request" and "checkout their request" using
>> the same value. Otherwise one has to ADD a bit to check in a request
>> and then ADD in the 2's complement to check it back out (or OR on a
>> bit and AND-NOT it back off).
>>
>> Most of the statemachine logic in my simulators uses XOR as in:
>>
>> if( p->state& STATE1 )
>> p->state ^= STATE1 | STATE2;
>> else
>>
>> // where it is known STATE1 and STATE2 have no common bits
>> // but may have more than one bit set.
>>
>> Mitch
>
>
> Right. But do you need that to be atomic?
>
> Can you have multiple threads using the same p->state at the same time?
> Setting the same bits?
>
> Or possibly setting disjoint bits in the same word?
>
> To get XOR the safely set and clear different bits, you have to know the
> prior state. Perhaps and AND-OR instruction?
>
> AND-OR
> tmp := mem
> tmp &= and_mask (or ~and_mask)
> tmp |= or_mask
> mem := tmp
>
> A nice instruction, but it really wants two immediates.
>

Exactly, it is hard to figure out a nice usage scenario for the existing 
logical versions.

If you want to share a bitmask between multiple processes, 
simultaneously updating disjoint sits of bits, it seems like a classic 
CAS is more natural.

OTOH, if those bits really are disjoint, then it would be far better to 
store them in separate cache lines!

A common mask only makes sense if there are many readers of the full mask.

The intended use would seem to require two such operations for each 
update, one to turn off bits and another to turn on some other bits, 
making the combined update non-atomic.

The alternative if each bit can only have a single writer would be to 
use the atomic XOR on the bits to be modified.

(BTW, can you always turn such an XOR (when the current value of those 
bits are known) into an XADD of a magic value?)

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#6311

From	Nomen Nescio <nobody@dizum.com>
Date	2012-03-08 22:33 +0100
Message-ID	<a6f6e132ebfe56cbb374b441865c7c9f@dizum.com>
In reply to	#6305

> > To get XOR the safely set and clear different bits, you have to know the
> > prior state.

Yes, this is it. Until now setting flag bits atomically required a macro
like OIL, NIL (see IBM z/OS Assembler Services Reference). The macros
required a work area and wrapped a CS around the logical operation.

If I understand the issue (haven't gone through this thread carefully and
haven't reviewed the z196 POPS but I have used OIL, NIL etc) then this
allows the assembler coder to do this more compactly and simply than by
invoking a macro. This functionality is required in a typical
multiprocessing scenario where multiple tasks are testing and setting flags
and where illogical combinations of flags must be avoided.

[toc] | [prev] | [next] | [standalone]

#6313

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-03-08 17:17 -0800
Message-ID	<4F595A1B.9050100@SPAM.comp-arch.net>
In reply to	#6305

On 3/8/2012 9:39 AM, Terje Mathisen wrote:
> Andy (Super) Glew wrote:
>> On 3/7/2012 1:43 PM, MitchAlsup wrote:
>>> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
>>>> I have some problem figuring out the intended usage for the AND/OR/XOR
>>>> versions of this operation:
>>>
>>> XOR is useful for setting bits and resetting bits (from the same
>>> constant).
>> To get XOR the safely set and clear different bits, you have to know the
>> prior state.
>
> Exactly, it is hard to figure out a nice usage scenario for the existing
> logical versions.

Some simple checksums are simple XORs. Without shifting.

This can be parallelized.

Hmm, a one time pad may simply XOR. Or perhaps a step in breaking some 
crypto is applying lots of XORs...

[toc] | [prev] | [next] | [standalone]

#6314

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-03-09 08:19 -0800
Message-ID	<15229834.761.1331309987212.JavaMail.geo-discussion-forums@ynbq18>
In reply to	#6305

On Thursday, March 8, 2012 11:39:13 AM UTC-6, Terje Mathisen wrote:
> Andy (Super) Glew wrote:
> > On 3/7/2012 1:43 PM, MitchAlsup wrote:
> >> On Wednesday, March 7, 2012 1:00:20 PM UTC-6, Terje Mathisen wrote:
> >>> I have some problem figuring out the intended usage for the AND/OR/XOR
> >>> versions of this operation:
> >>
> >> XOR is useful for setting bits and resetting bits (from the same
> >> constant).
> >>
> >> So one can "check in their request" and "checkout their request" using
> >> the same value. Otherwise one has to ADD a bit to check in a request
> >> and then ADD in the 2's complement to check it back out (or OR on a
> >> bit and AND-NOT it back off).
> >>
> >> Most of the statemachine logic in my simulators uses XOR as in:
> >>
> >> if( p->state& STATE1 )
> >> p->state ^= STATE1 | STATE2;
> >> else
> >>
> >> // where it is known STATE1 and STATE2 have no common bits
> >> // but may have more than one bit set.
> >>
> >> Mitch
> >
> >
> > Right. But do you need that to be atomic?
> >
> > Can you have multiple threads using the same p->state at the same time?
> > Setting the same bits?
> >
> > Or possibly setting disjoint bits in the same word?
> >
> > To get XOR the safely set and clear different bits, you have to know the
> > prior state. Perhaps and AND-OR instruction?
> >
> > AND-OR
> > tmp := mem
> > tmp &= and_mask (or ~and_mask)
> > tmp |= or_mask
> > mem := tmp
> >
> > A nice instruction, but it really wants two immediates.
> >
> 
> Exactly, it is hard to figure out a nice usage scenario for the existing 
> logical versions.

I have seen a parallel processing implementation that used the bit to denote which processor was requesting/waiting-for the lock (kernel only). As long as the number of processors is smaller than wordsize this works well.

Mitch

[toc] | [prev] | [next] | [standalone]

#6315

From	Terje Mathisen <"terje.mathisen at tmsw.no">
Date	2012-03-09 17:45 +0100
Message-ID	<jlop29-ll21.ln1@ntp6.tmsw.no>
In reply to	#6314

MitchAlsup wrote:
> On Thursday, March 8, 2012 11:39:13 AM UTC-6, Terje Mathisen wrote:
>> Exactly, it is hard to figure out a nice usage scenario for the
>> existing logical versions.
>
> I have seen a parallel processing implementation that used the bit to
> denote which processor was requesting/waiting-for the lock (kernel
> only). As long as the number of processors is smaller than wordsize
> this works well.

This dovetails nicely with the one scenario I did find: When you have 
lots of processes sharing a bitmap, but each individual bit can only 
have a single writer:

In this case said owner can use AND to clear it, ORto set it and XOR to 
toggle it.

In fact, in this scenario the owner will always have a local copy, so 
even without reading it first, the XOR version is sufficient for all 
cases where the bit must be modified.

How is this (bus transaction-wise) different from a simple

   LOCK XOR [bitmask], my_bits_to_toggle

?

You can also use XADD with +/- my_bits_to_toggle to get the same effect, 
can't you?

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#6318

From	jgk@panix.com (Joe keane)
Date	2012-03-09 20:30 +0000
Message-ID	<jjdp9u$l77$1@reader1.panix.com>
In reply to	#6292

In article <5rnk29-riq.ln1@ntp6.tmsw.no>,
Terje Mathisen  <"terje.mathisen at tmsw.no"> wrote:
>I have some problem figuring out the intended usage for the AND/OR/XOR 
>versions of this operation:

Maybe it takes more transistors to not implement them.

[toc] | [prev] | [next] | [standalone]

#6320

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-03-09 13:10 -0800
Message-ID	<32946367.1627.1331327445478.JavaMail.geo-discussion-forums@ynne2>
In reply to	#6318

On Friday, March 9, 2012 2:30:54 PM UTC-6, Joe keane wrote:
> Maybe it takes more transistors to not implement them.

It probably takes more transistors in the decode stage of the pipeline to decode the new instructions that all the transistors used in the "performance" of the instruction at the memory.

Mitch

[toc] | [prev] | [standalone]

csiph-web

Ne(ish) IBM z196 synchronization instructions

Contents

#6282 — Ne(ish) IBM z196 synchronization instructions

#6287

#6288

#6289

#6317

#6292

#6295

#6304

#6305

#6311

#6313

#6314

#6315

#6318

#6320