Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.arch > #6405 > unrolled thread
| Started by | "Paul A. Clayton" <paaronclayton@gmail.com> |
|---|---|
| First post | 2012-03-27 17:35 -0700 |
| Last post | 2012-04-14 08:32 -0700 |
| Articles | 8 — 6 participants |
Back to article view | Back to comp.arch
Architecting a return address stack "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-03-27 17:35 -0700
Re: Architecting a return address stack MitchAlsup <MitchAlsup@aol.com> - 2012-03-28 08:57 -0700
Re: Architecting a return address stack ggtgp@yahoo.com - 2012-04-04 12:03 -0700
Re: Architecting a return address stack MitchAlsup <MitchAlsup@aol.com> - 2012-04-05 09:08 -0700
Re: Architecting a return address stack torbenm@diku.dk (Torben Ægidius Mogensen) - 2012-04-10 12:23 +0200
Re: Architecting a return address stack Fritz Wuehler <fritz@spamexpire-201204.rodent.frell.theremailer.net> - 2012-04-10 22:22 +0200
Re: Architecting a return address stack torbenm@diku.dk (Torben Ægidius Mogensen) - 2012-04-12 09:52 +0200
Re: Architecting a return address stack jacko <jackokring@gmail.com> - 2012-04-14 08:32 -0700
| From | "Paul A. Clayton" <paaronclayton@gmail.com> |
|---|---|
| Date | 2012-03-27 17:35 -0700 |
| Subject | Architecting a return address stack |
| Message-ID | <37e7fa2b-2d98-428d-bdb8-369b995bcaad@d17g2000vba.googlegroups.com> |
While an architected return address stack can provide a modest benefit from early return address determination (a predictive return address stack provides most of this benefit but is predictive, of limited size, and stores information redundantly) and protection from return address overwriting, such introduces some overhead which may make it less attractive at the very low end (the level of, e.g., ARM Cortex-M0+) and at the higher end (providing an additional OoO execution mechanism for such special purpose registers and memory accesses could be a nuisance, though being highly decoupled from ordinary execution should constrain complexity it seems). (It is not clear if such a stack would have any stack unwinding or debug benefits.) One overhead for a very low end implementation is the requirement for an additional pointer. (I do not know what extra overhead would be generated by such a special case in terms of instruction decode and execution.) Another overhead involves the separate protection of this stack, particularly with respect to bounding its size without wasting memory but also (possibly) protecting the stack from inappropriate accesses (at the very low end such protection might not be important). Two thoughts came to mind with respect to this overhead. First, the return address stack might save one additional value with the return address. This might be a simple choice of two fixed register names, one being the general stack pointer (to save a frame) and the other being a caller-save register. Saving one additional value might waste a little extra storage and memory bandwidth but would increase utilization of the memory space of the return address stack. (Saving a larger fixed number of registers would waste more space/bandwidth and saving a variable number of registers would make management of the stack and prefetching return addresses more complex.) It might be reasonable to allow a fixed-per-thread number of values to be save [e.g., 1, 2, or 4 values], but it is not clear that the benefit of such would be significant. There is also an issue with supporting an address space (for code at least) smaller than the size of registers that might be saved, but this seems like a minor issue. The second thought was that the return address stack pointer could be masked to provide a thread local storage pointer (using offsets of the opposite direction of stack growth). (If the mask included more significant bits, it might even be used to provide a global pointer for multiple threads.) One obvious issue with this use is the requirement that the return address stack not pass an alignment boundary for which the mask is established. With address translation, it would be somewhat simple to shift the stack, but without address translation values would have to be copied (possibly many values if the protection mechanism was not fine-grained). One might also use the area beyond the limit of the stack for special storage. (On the positive side, such a mask might also help with the bounding of the stack, so the additional overhead might not be especially great.) Even apart from the unlikeliness of a new ISA being introduced, it seems there are significant issues with providing such a separate stack, but perhaps some comp.arch readers might enjoy sharing their thoughts.
[toc] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-03-28 08:57 -0700 |
| Message-ID | <3547937.746.1332950262513.JavaMail.geo-discussion-forums@ynjx8> |
| In reply to | #6405 |
Do yourself (and everyone else) a favor and don't. That is, let the stack, which is an inherently software feature, be defined in a way that best fits the needs of the software, by those who have to maintain the software. Mitch
[toc] | [prev] | [next] | [standalone]
| From | ggtgp@yahoo.com |
|---|---|
| Date | 2012-04-04 12:03 -0700 |
| Message-ID | <29266476.18.1333566235966.JavaMail.geo-discussion-forums@ynll3> |
| In reply to | #6407 |
On Wednesday, March 28, 2012 10:57:42 AM UTC-5, MitchAlsup wrote: > Do yourself (and everyone else) a favor and don't. > > That is, let the stack, which is an inherently software feature, be defined in a way that best fits the needs of the software, by those who have to maintain the software. So you prefer the Alpha approach of the LINK register being a normal register, as opposed to the more typical special register that is mostly hidden? You can throw some special hardware at LINK, but I think most just use a register. Is the idea of throwing extra hardware at LINK just futile duplication?
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-04-05 09:08 -0700 |
| Message-ID | <7282732.580.1333642090699.JavaMail.geo-discussion-forums@ynkh5> |
| In reply to | #6442 |
On Wednesday, April 4, 2012 2:03:55 PM UTC-5, gg...@yahoo.com wrote: > On Wednesday, March 28, 2012 10:57:42 AM UTC-5, MitchAlsup wrote: > > Do yourself (and everyone else) a favor and don't. > > > > That is, let the stack, which is an inherently software feature, be defined in a way that best fits the needs of the software, by those who have to maintain the software. > > So you prefer the Alpha approach of the LINK register being a normal register, > as opposed to the more typical special register that is mostly hidden? > > You can throw some special hardware at LINK, but I think most just use a register. > Is the idea of throwing extra hardware at LINK just futile duplication? For the extremely restircted branch-to-subroutine instruction, I accept the necessity to have a predefined place (register) to put the return address. For the similar jump-to-subroutine instruction, the instruction can provide the register bits needed. Whether the JSR does or not is at the architects discresssion. What I was saying back to Paul was that the hardware should not define what stacks are (place, register, movement direction) as this should be (and can be) completely defined, and implemented in software. This bypasses the subtle bugs that one might get if one defines push and pop instructions with various operand sizes that come up when the stack pointer gets misaligned wrt the next item to be pushed or popped. Software can work around these issues a lot easier than hardware can. It also avoids issues when the base architecture changes size (from 32-bits to 64-bits: for example). Mitch
[toc] | [prev] | [next] | [standalone]
| From | torbenm@diku.dk (Torben Ægidius Mogensen) |
|---|---|
| Date | 2012-04-10 12:23 +0200 |
| Message-ID | <7zlim3lwbs.fsf@ask.diku.dk> |
| In reply to | #6405 |
"Paul A. Clayton" <paaronclayton@gmail.com> writes: > While an architected return address stack can provide a modest > benefit from early return address determination (a predictive > return address stack provides most of this benefit but is > predictive, of limited size, and stores information redundantly) > and protection from return address overwriting, such introduces > some overhead which may make it less attractive at the very low > end (the level of, e.g., ARM Cortex-M0+) and at the higher end > (providing an additional OoO execution mechanism for such > special purpose registers and memory accesses could be a > nuisance, though being highly decoupled from ordinary > execution should constrain complexity it seems). One possible benefit from a hardware return stack is to execute the continuation after the return concurrently with the call. Instead of saving registers on the return stack, the caller gets its own copy of the registers (through renaming) and is executed in parallel with the continuation, synchronising through the return-value register. If there is not enough registers (or threads) for the new call, a thread is spilled to a stack to make room. In a normal memory model, the continuation can continue until either it accesses the return value register or a memory location (as this might potentially be overwritten by the called procedure), but you could have a more "functional" memory model where it is assumed that the called procedure does not modify memory reachable by the caller (except through a pointer in the return-value register). This would allow for more parallelism. You could have two kinds of memory-access instructions: Functional and imperative, where a thread would stop on an imperative memory access until all calls higher in the stack are done, but where functional memory accesses would not block. Enforcing that functional memory accesses do, indeed, behave in a write-once-read-many way may be expensive, but you could leave that to the compiler to ensure and simply say that behaviour is undefined if this is not done. Torben
[toc] | [prev] | [next] | [standalone]
| From | Fritz Wuehler <fritz@spamexpire-201204.rodent.frell.theremailer.net> |
|---|---|
| Date | 2012-04-10 22:22 +0200 |
| Message-ID | <c2ee7d980f31d0c0e7660bbbc99b7688@msgid.frell.theremailer.net> |
| In reply to | #6468 |
This doesn't sound so different from UltraSPARC. Has anybody written a LISP or other language to optimise tail recursion in this environment? torbenm@diku.dk (Torben Ægidius Mogensen) wrote: > "Paul A. Clayton" <paaronclayton@gmail.com> writes: > > > While an architected return address stack can provide a modest > > benefit from early return address determination (a predictive > > return address stack provides most of this benefit but is > > predictive, of limited size, and stores information redundantly) > > and protection from return address overwriting, such introduces > > some overhead which may make it less attractive at the very low > > end (the level of, e.g., ARM Cortex-M0+) and at the higher end > > (providing an additional OoO execution mechanism for such > > special purpose registers and memory accesses could be a > > nuisance, though being highly decoupled from ordinary > > execution should constrain complexity it seems). > > One possible benefit from a hardware return stack is to execute the > continuation after the return concurrently with the call. Instead of > saving registers on the return stack, the caller gets its own copy of > the registers (through renaming) and is executed in parallel with the > continuation, synchronising through the return-value register. If there > is not enough registers (or threads) for the new call, a thread is > spilled to a stack to make room. > > In a normal memory model, the continuation can continue until either it > accesses the return value register or a memory location (as this might > potentially be overwritten by the called procedure), but you could have > a more "functional" memory model where it is assumed that the called > procedure does not modify memory reachable by the caller (except through > a pointer in the return-value register). This would allow for more > parallelism. You could have two kinds of memory-access instructions: > Functional and imperative, where a thread would stop on an imperative > memory access until all calls higher in the stack are done, but where > functional memory accesses would not block. Enforcing that functional > memory accesses do, indeed, behave in a write-once-read-many way may be > expensive, but you could leave that to the compiler to ensure and simply > say that behaviour is undefined if this is not done. > > Torben
[toc] | [prev] | [next] | [standalone]
| From | torbenm@diku.dk (Torben Ægidius Mogensen) |
|---|---|
| Date | 2012-04-12 09:52 +0200 |
| Message-ID | <7zr4vtwfmy.fsf@ask.diku.dk> |
| In reply to | #6482 |
Fritz Wuehler <fritz@spamexpire-201204.rodent.frell.theremailer.net> writes: > torbenm@diku.dk (Torben Ægidius Mogensen) wrote: > >> One possible benefit from a hardware return stack is to execute the >> continuation after the return concurrently with the call. > > This doesn't sound so different from UltraSPARC. Has anybody written a LISP > or other language to optimise tail recursion in this environment? It would not make much sense for tail recursion, as there isn't any continuation to speak of to execute in parallel with the call (nor would you push any return address). Torben
[toc] | [prev] | [next] | [standalone]
| From | jacko <jackokring@gmail.com> |
|---|---|
| Date | 2012-04-14 08:32 -0700 |
| Message-ID | <15932501.218.1334417547986.JavaMail.geo-discussion-forums@vbbfj25> |
| In reply to | #6405 |
I see the biggest issue is the setting of the hidden stack pointer needing to flush the whole T-cache (for implementation ease), and the need to make the setting operation protected mode only. For a RISC pipeline, I think a hidden link register and a save link instruction, along with return instruction and a dedicated call address instruction are needed.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.arch
csiph-web