Groups > comp.arch > #6405 > unrolled thread

Architecting a return address stack

Started by	"Paul A. Clayton" <paaronclayton@gmail.com>
First post	2012-03-27 17:35 -0700
Last post	2012-04-14 08:32 -0700
Articles	8 — 6 participants

Back to article view | Back to comp.arch

  Architecting a return address stack "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-27 17:35 -0700
    Re: Architecting a return address stack MitchAlsup <MitchAlsup@aol.com> - 2012-03-28 08:57 -0700
      Re: Architecting a return address stack ggtgp@yahoo.com - 2012-04-04 12:03 -0700
        Re: Architecting a return address stack MitchAlsup <MitchAlsup@aol.com> - 2012-04-05 09:08 -0700
    Re: Architecting a return address stack torbenm@diku.dk (Torben Ægidius Mogensen) - 2012-04-10 12:23 +0200
      Re: Architecting a return address stack Fritz Wuehler <fritz@spamexpire-201204.rodent.frell.theremailer.net> - 2012-04-10 22:22 +0200
        Re: Architecting a return address stack torbenm@diku.dk (Torben Ægidius Mogensen) - 2012-04-12 09:52 +0200
    Re: Architecting a return address stack jacko <jackokring@gmail.com> - 2012-04-14 08:32 -0700

#6405 — Architecting a return address stack

From	"Paul A. Clayton" <paaronclayton@gmail.com>
Date	2012-03-27 17:35 -0700
Subject	Architecting a return address stack
Message-ID	<37e7fa2b-2d98-428d-bdb8-369b995bcaad@d17g2000vba.googlegroups.com>

While an architected return address stack can provide a modest
benefit from early return address determination (a predictive
return address stack provides most of this benefit but is
predictive, of limited size, and stores information redundantly)
and protection from return address overwriting, such introduces
some overhead which may make it less attractive at the very low
end (the level of, e.g., ARM Cortex-M0+) and at the higher end
(providing an additional OoO execution mechanism for such
special purpose registers and memory accesses could be a
nuisance, though being highly decoupled from ordinary
execution should constrain complexity it seems).

(It is not clear if such a stack would have any stack
unwinding or debug benefits.)

One overhead for a very low end implementation is the
requirement for an additional pointer.  (I do not know what
extra overhead would be generated by such a special case
in terms of instruction decode and execution.)

Another overhead involves the separate protection of this
stack, particularly with respect to bounding its size without
wasting memory but also (possibly) protecting the stack from
inappropriate accesses (at the very low end such protection
might not be important).

Two thoughts came to mind with respect to this overhead.
First, the return address stack might save one additional value
with the return address.  This might be a simple choice of two
fixed register names, one being the general stack pointer (to
save a frame) and the other being a caller-save register.
Saving one additional value might waste a little extra storage
and memory bandwidth but would increase utilization of the
memory space of the return address stack.  (Saving a larger
fixed number of registers would waste more space/bandwidth and
saving a variable number of registers would make management of
the stack and prefetching return addresses more complex.)

It might be reasonable to allow a fixed-per-thread number of
values to be save [e.g., 1, 2, or 4 values], but it is not
clear that the benefit of such would be significant.  There is
also an issue with supporting an address space (for code at
least) smaller than the size of registers that might be saved,
but this seems like a minor issue.

The second thought was that the return address stack pointer
could be masked to provide a thread local storage pointer
(using offsets of the opposite direction of stack growth).
(If the mask included more significant bits, it might even be
used to provide a global pointer for multiple threads.)  One
obvious issue with this use is the requirement that the return
address stack not pass an alignment boundary for which the mask
is established.  With address translation, it would be somewhat
simple to shift the stack, but without address translation
values would have to be copied (possibly many values if the
protection mechanism was not fine-grained).  One might also
use the area beyond the limit of the stack for special storage.

(On the positive side, such a mask might also help with the
bounding of the stack, so the additional overhead might not be
especially great.)

Even apart from the unlikeliness of a new ISA being introduced,
it seems there are significant issues with providing such a
separate stack, but perhaps some comp.arch readers might
enjoy sharing their thoughts.

[toc] | [next] | [standalone]

#6407

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-03-28 08:57 -0700
Message-ID	<3547937.746.1332950262513.JavaMail.geo-discussion-forums@ynjx8>
In reply to	#6405

Do yourself (and everyone else) a favor and don't.

That is, let the stack, which is an inherently software feature, be defined in a way that best fits the needs of the software, by those who have to maintain the software.

Mitch

[toc] | [prev] | [next] | [standalone]

#6442

From	ggtgp@yahoo.com
Date	2012-04-04 12:03 -0700
Message-ID	<29266476.18.1333566235966.JavaMail.geo-discussion-forums@ynll3>
In reply to	#6407

On Wednesday, March 28, 2012 10:57:42 AM UTC-5, MitchAlsup wrote:
> Do yourself (and everyone else) a favor and don't.
> 
> That is, let the stack, which is an inherently software feature, be defined in a way that best fits the needs of the software, by those who have to maintain the software.

So you prefer the Alpha approach of the LINK register being a normal register,
as opposed to the more typical special register that is mostly hidden?

You can throw some special hardware at LINK, but I think most just use a register.
Is the idea of throwing extra hardware at LINK just futile duplication?

[toc] | [prev] | [next] | [standalone]

#6453

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-04-05 09:08 -0700
Message-ID	<7282732.580.1333642090699.JavaMail.geo-discussion-forums@ynkh5>
In reply to	#6442

On Wednesday, April 4, 2012 2:03:55 PM UTC-5, gg...@yahoo.com wrote:
> On Wednesday, March 28, 2012 10:57:42 AM UTC-5, MitchAlsup wrote:
> > Do yourself (and everyone else) a favor and don't.
> > 
> > That is, let the stack, which is an inherently software feature, be defined in a way that best fits the needs of the software, by those who have to maintain the software.
> 
> So you prefer the Alpha approach of the LINK register being a normal register,
> as opposed to the more typical special register that is mostly hidden?
> 
> You can throw some special hardware at LINK, but I think most just use a register.
> Is the idea of throwing extra hardware at LINK just futile duplication?

For the extremely restircted branch-to-subroutine instruction, I accept the necessity to have a predefined place (register) to put the return address. For the similar jump-to-subroutine instruction, the instruction can provide the register bits needed. Whether the JSR does or not is at the architects discresssion.

What I was saying back to Paul was that the hardware should not define what stacks are (place, register, movement direction) as this should be (and can be) completely defined, and implemented in software. This bypasses the subtle bugs that one might get if one defines push and pop instructions with various operand sizes that come up when the stack pointer gets misaligned wrt the next item to be pushed or popped. Software can work around these issues a lot easier than hardware can. It also avoids issues when the base architecture changes size (from 32-bits to 64-bits: for example).

Mitch

[toc] | [prev] | [next] | [standalone]

#6468

From	torbenm@diku.dk (Torben Ægidius Mogensen)
Date	2012-04-10 12:23 +0200
Message-ID	<7zlim3lwbs.fsf@ask.diku.dk>
In reply to	#6405

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

> While an architected return address stack can provide a modest
> benefit from early return address determination (a predictive
> return address stack provides most of this benefit but is
> predictive, of limited size, and stores information redundantly)
> and protection from return address overwriting, such introduces
> some overhead which may make it less attractive at the very low
> end (the level of, e.g., ARM Cortex-M0+) and at the higher end
> (providing an additional OoO execution mechanism for such
> special purpose registers and memory accesses could be a
> nuisance, though being highly decoupled from ordinary
> execution should constrain complexity it seems).

One possible benefit from a hardware return stack is to execute the
continuation after the return concurrently with the call.  Instead of
saving registers on the return stack, the caller gets its own copy of
the registers (through renaming) and is executed in parallel with the
continuation, synchronising through the return-value register.  If there
is not enough registers (or threads) for the new call, a thread is
spilled to a stack to make room.

In a normal memory model, the continuation can continue until either it
accesses the return value register or a memory location (as this might
potentially be overwritten by the called procedure), but you could have
a more "functional" memory model where it is assumed that the called
procedure does not modify memory reachable by the caller (except through
a pointer in the return-value register).  This would allow for more
parallelism.  You could have two kinds of memory-access instructions:
Functional and imperative, where a thread would stop on an imperative
memory access until all calls higher in the stack are done, but where
functional memory accesses would not block.  Enforcing that functional
memory accesses do, indeed, behave in a write-once-read-many way may be
expensive, but you could leave that to the compiler to ensure and simply
say that behaviour is undefined if this is not done.

	Torben

[toc] | [prev] | [next] | [standalone]

#6482

From	Fritz Wuehler <fritz@spamexpire-201204.rodent.frell.theremailer.net>
Date	2012-04-10 22:22 +0200
Message-ID	<c2ee7d980f31d0c0e7660bbbc99b7688@msgid.frell.theremailer.net>
In reply to	#6468

This doesn't sound so different from UltraSPARC. Has anybody written a LISP
or other language to optimise tail recursion in this environment?

torbenm@diku.dk (Torben Ægidius Mogensen) wrote:

> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
> 
> > While an architected return address stack can provide a modest
> > benefit from early return address determination (a predictive
> > return address stack provides most of this benefit but is
> > predictive, of limited size, and stores information redundantly)
> > and protection from return address overwriting, such introduces
> > some overhead which may make it less attractive at the very low
> > end (the level of, e.g., ARM Cortex-M0+) and at the higher end
> > (providing an additional OoO execution mechanism for such
> > special purpose registers and memory accesses could be a
> > nuisance, though being highly decoupled from ordinary
> > execution should constrain complexity it seems).
> 
> One possible benefit from a hardware return stack is to execute the
> continuation after the return concurrently with the call.  Instead of
> saving registers on the return stack, the caller gets its own copy of
> the registers (through renaming) and is executed in parallel with the
> continuation, synchronising through the return-value register.  If there
> is not enough registers (or threads) for the new call, a thread is
> spilled to a stack to make room.
> 
> In a normal memory model, the continuation can continue until either it
> accesses the return value register or a memory location (as this might
> potentially be overwritten by the called procedure), but you could have
> a more "functional" memory model where it is assumed that the called
> procedure does not modify memory reachable by the caller (except through
> a pointer in the return-value register).  This would allow for more
> parallelism.  You could have two kinds of memory-access instructions:
> Functional and imperative, where a thread would stop on an imperative
> memory access until all calls higher in the stack are done, but where
> functional memory accesses would not block.  Enforcing that functional
> memory accesses do, indeed, behave in a write-once-read-many way may be
> expensive, but you could leave that to the compiler to ensure and simply
> say that behaviour is undefined if this is not done.
> 
> 	Torben

[toc] | [prev] | [next] | [standalone]

#6500

From	torbenm@diku.dk (Torben Ægidius Mogensen)
Date	2012-04-12 09:52 +0200
Message-ID	<7zr4vtwfmy.fsf@ask.diku.dk>
In reply to	#6482

Fritz Wuehler <fritz@spamexpire-201204.rodent.frell.theremailer.net>
writes:

> torbenm@diku.dk (Torben Ægidius Mogensen) wrote:
>
>> One possible benefit from a hardware return stack is to execute the
>> continuation after the return concurrently with the call.
>
> This doesn't sound so different from UltraSPARC. Has anybody written a LISP
> or other language to optimise tail recursion in this environment?

It would not make much sense for tail recursion, as there isn't any
continuation to speak of to execute in parallel with the call (nor would
you push any return address).

	Torben

[toc] | [prev] | [next] | [standalone]

#6540

From	jacko <jackokring@gmail.com>
Date	2012-04-14 08:32 -0700
Message-ID	<15932501.218.1334417547986.JavaMail.geo-discussion-forums@vbbfj25>
In reply to	#6405

I see the biggest issue is the setting of the hidden stack pointer needing to flush the whole T-cache (for implementation ease), and the need to make the setting operation protected mode only. For a RISC pipeline, I think a hidden link register and a save link instruction, along with return instruction and a dedicated call address instruction are needed.

[toc] | [prev] | [standalone]

csiph-web

Architecting a return address stack

Contents

#6405 — Architecting a return address stack

#6407

#6442

#6453

#6468

#6482

#6500

#6540