Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.arch > #5703 > unrolled thread
| Started by | "Unspecified" <partha@perfectvips.com> |
|---|---|
| First post | 2012-02-04 21:54 +0530 |
| Last post | 2012-02-08 01:04 -0800 |
| Articles | 20 on this page of 72 — 25 participants |
Back to article view | Back to comp.arch
Single Thread Performance "Unspecified" <partha@perfectvips.com> - 2012-02-04 21:54 +0530
Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-06 05:55 -0600
Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-06 13:06 +0000
Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 14:12 -0500
Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-06 13:36 -0700
Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-06 20:47 +0000
Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-06 15:07 -0700
Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-06 16:32 -0600
Re: Single Thread Performance Robert Wessel <robertwessel2@yahoo.com> - 2012-02-06 17:45 -0600
Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-07 06:01 -0600
Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-07 13:32 -0700
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-09 19:08 +0000
Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-10 08:56 -0700
Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-06 20:42 +0000
Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 19:36 -0500
Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-06 18:28 -0800
Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 22:23 -0500
Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-07 06:52 +0000
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-06 12:10 -0800
Re: Single Thread Performance Thomas Womack <twomack@chiark.greenend.org.uk> - 2012-02-07 10:13 +0000
Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-20 23:58 -0600
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-07 17:33 +0000
Re: Single Thread Performance nedbrek <nedbrek@yahoo.com> - 2012-02-15 08:10 -0500
Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 14:17 -0500
Re: Single Thread Performance del cecchi <delcecchi@gmail.com> - 2012-02-25 22:07 -0800
Re: Single Thread Performance jgk@panix.com (Joe keane) - 2012-02-07 17:57 +0000
Re: Single Thread Performance Quadibloc <jsavard@ecn.ab.ca> - 2012-02-05 13:13 -0800
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-05 21:35 -0800
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-07 17:38 +0000
Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-07 14:54 -0600
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-07 21:33 +0000
Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-07 23:13 -0600
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-08 18:54 +0000
Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-08 15:17 -0600
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-09 08:13 +0100
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-09 17:08 +0000
Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-09 16:01 -0600
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-09 07:56 +0100
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-09 17:18 +0000
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-08 10:17 +0100
Re: Single Thread Performance Jon <jon@beniston.com> - 2012-02-08 05:32 -0800
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-07 16:00 -0800
Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-08 18:35 +0000
Re: Single Thread Performance Partha <parthaspanda22@gmail.com> - 2012-02-10 11:32 -0800
Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-10 20:31 +0000
Re: Single Thread Performance "Unspecified" <partha@perfectvips.com> - 2012-02-11 02:12 +0530
Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-10 21:04 +0000
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-10 16:43 -0800
Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-10 19:48 -0800
Re: Single Thread Performance EricP <ThatWouldBeTelling@thevillage.com> - 2012-02-12 14:31 -0500
Re: Single Thread Performance Stefan Monnier <monnier@iro.umontreal.ca> - 2012-02-12 21:50 -0500
Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-12 19:45 -0800
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-12 20:36 -0800
Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-13 06:46 -0800
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-13 08:58 -0800
Re: Single Thread Performance "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-02-13 16:19 -0800
Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 03:55 +0000
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-14 10:30 +0100
Re: Single Thread Performance Andrew Reilly <areilly---@bigpond.net.au> - 2012-02-14 10:49 +0000
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-14 13:21 +0100
Re: Single Thread Performance Stephen Fuld <SFuld@alumni.cmu.edu.invalid> - 2012-02-14 13:11 -0800
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-14 09:29 -0800
Re: Single Thread Performance EricP <ThatWouldBeTelling@thevillage.com> - 2012-02-14 12:40 -0500
Re: Single Thread Performance EricP <ThatWouldBeTelling@thevillage.com> - 2012-02-14 16:12 -0500
Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 21:14 +0000
Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 21:16 +0000
Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 21:09 +0000
Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-14 09:26 -0800
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-15 08:44 +0100
Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-15 01:07 -0800
Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-14 10:16 +0100
Re: Single Thread Performance Michael S <already5chosen@yahoo.com> - 2012-02-08 01:04 -0800
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
| From | Jon <jon@beniston.com> |
|---|---|
| Date | 2012-02-08 05:32 -0800 |
| Message-ID | <ba6087e6-12e2-4df3-ab75-c86662934d69@e27g2000vbu.googlegroups.com> |
| In reply to | #5803 |
> How? That would necessitate going from six (3+3) to fifteen (5+5+5) > operand bits per instruction. > > Granted, you would eliminate some instructions by reducing stack spills > and register copies, but it's hard to imagine that would be enough to > overcome increasing the size of _every_ instruction by an entire byte. You don't need to increase every instruction. Not every instruction requires 3 operands or always benefits from them and obviously not every function requires that many registers. The eSi-RISC architecture, for example, has 16-bit instructions that support 2 regs and access to the lowest 16-regs, as well as 32-bit instructions that support 3 operands and access to 32 regs. Cheers, Jon
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-02-07 16:00 -0800 |
| Message-ID | <9164712.4469.1328659228862.JavaMail.geo-discussion-forums@vbbfd4> |
| In reply to | #5795 |
On Tuesday, February 7, 2012 11:38:32 AM UTC-6, Tim McCaffrey wrote: > I seem to remember you quoting similar numbers for 32 vs 16 registers, I quote this as 3%-ish (compiled code) > and 3 operand vs 2 operand ISAs. Don't remember ever quoting a delta here. > So, design an ISA with 32 registers, 3 operand opcodes that is just as code > dense as an x86 (for I-Cache efficiency), Don't know if any RISC encoding ever got this close to x86 code density. > double the area, cost and power and > you might get anywhere from 15-50% performance increase. Given a several-wide x86 machine (like say an Opteron) in an already advanced process (like say Intel FABs): I you can double the power budget, you can probably get 10% more ILP and probably 30% more frequency. Mitch
[toc] | [prev] | [next] | [standalone]
| From | timcaffrey@aol.com (Tim McCaffrey) |
|---|---|
| Date | 2012-02-08 18:35 +0000 |
| Message-ID | <jgufaa$4l4$1@USTR-NEWS.TR.UNISYS.COM> |
| In reply to | #5809 |
In article <9164712.4469.1328659228862.JavaMail.geo-discussion-forums@vbbfd4>, MitchAlsup@aol.com says... > >On Tuesday, February 7, 2012 11:38:32 AM UTC-6, Tim McCaffrey wrote: > >> and 3 operand vs 2 operand ISAs. > >Don't remember ever quoting a delta here. > Perhaps it was an old John Mashey post I was thinking of. - Tim
[toc] | [prev] | [next] | [standalone]
| From | Partha <parthaspanda22@gmail.com> |
|---|---|
| Date | 2012-02-10 11:32 -0800 |
| Message-ID | <cc21bb69-16bf-4692-a8b4-7244bdc87a94@vd8g2000pbc.googlegroups.com> |
| In reply to | #5777 |
> Single threaded performance is pretty much tapped out. Given the current cores one might be able to get 5%-to-15% more out of a single thread at double the cost, area, and power. > > So, the economies of scale and basic limits have turned the tables, so that one can get more performance from 2,3,4,6,8,... cores than one can get from an equivalent amount of area/power/engineering into single threaded architecture. What if one looked at what's upstream? Like, programming languages for instance. It could be that a hardware issue(process technology) and an easy formula(Moore's law) could have been aimed at rather then exercising other options? Sincerely, Partha Sarathi Panda.
[toc] | [prev] | [next] | [standalone]
| From | nmm1@cam.ac.uk |
|---|---|
| Date | 2012-02-10 20:31 +0000 |
| Message-ID | <jh3uqm$c2$1@gosset.csi.cam.ac.uk> |
| In reply to | #5870 |
In article <cc21bb69-16bf-4692-a8b4-7244bdc87a94@vd8g2000pbc.googlegroups.com>, Partha <parthaspanda22@gmail.com> wrote: >> >> Single threaded performance is pretty much tapped out. Given the >> current cores one might be able to get 5%-to-15% more out of a >> single thread at double the cost, area, and power. >> >> So, the economies of scale and basic limits have turned the tables, >> so that one can get more performance from 2,3,4,6,8,... cores than >> one can get from an equivalent amount of area/power/engineering into >> single threaded architecture. > >What if one looked at what's upstream? Like, programming languages >for instance. It could be that a hardware issue(process technology) >and an easy formula(Moore's law) could have been aimed at rather >then exercising other options? Er, yes. I have been posting on that topic for a decade now! There is a lot going on in programming languages, though very little is as radical as I would favour. Regards, Nick Maclaren.
[toc] | [prev] | [next] | [standalone]
| From | "Unspecified" <partha@perfectvips.com> |
|---|---|
| Date | 2012-02-11 02:12 +0530 |
| Message-ID | <jh3veo$ojd$1@adenine.netfront.net> |
| In reply to | #5871 |
> Er, yes. I have been posting on that topic for a decade now! > There is a lot going on in programming languages, though very > little is as radical as I would favour. > Other then Transaction Memory, is there any other contribution? Also, I thought TM was a fallout of a decision taken in favour of multicores. Maybe I missed out on the fun in the programming languages. -- Sincerely, Partha Sarathi Panda. --- Posted via news://freenews.netfront.net/ - Complaints to news@netfront.net ---
[toc] | [prev] | [next] | [standalone]
| From | nmm1@cam.ac.uk |
|---|---|
| Date | 2012-02-10 21:04 +0000 |
| Message-ID | <jh40og$u5$1@gosset.csi.cam.ac.uk> |
| In reply to | #5872 |
In article <jh3veo$ojd$1@adenine.netfront.net>, Unspecified <partha@perfectvips.com> wrote: > >> Er, yes. I have been posting on that topic for a decade now! >> There is a lot going on in programming languages, though very >> little is as radical as I would favour. > >Other then Transaction Memory, is there any other contribution? >Also, I thought TM was a fallout of a decision taken in favour >of multicores. Look up OpenMP, Coarrays, C++11, Cilk, IBM X10, Cray Chapel, Sun Fortress and more. Regards, Nick Maclaren.
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-02-10 16:43 -0800 |
| Message-ID | <18985540.1742.1328921030856.JavaMail.geo-discussion-forums@yqks7> |
| In reply to | #5872 |
On Friday, February 10, 2012 2:42:03 PM UTC-6, Unspecified wrote: > Other then Transaction Memory, is there any other contribution? > Also, I thought TM was a fallout of a decision taken in favour > of multicores. Transactional Memory is an attempt to get around the inability of ATOMIC events to make forward progress in the face of contention. Where Atomic events are what one can do wiht the currently available primatives. Mitch
[toc] | [prev] | [next] | [standalone]
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-02-10 19:48 -0800 |
| Message-ID | <4F35E51B.7010903@SPAM.comp-arch.net> |
| In reply to | #5876 |
On 2/10/2012 4:43 PM, MitchAlsup wrote: > On Friday, February 10, 2012 2:42:03 PM UTC-6, Unspecified wrote: >> Other then Transaction Memory, is there any other contribution? >> Also, I thought TM was a fallout of a decision taken in favour >> of multicores. > > Transactional Memory is an attempt to get around the inability > of ATOMIC events to make forward progress in the face of contention. > Where Atomic events are what one can do wiht the currently available > primatives. > > Mitch Hmm... AFAIK Intel's HW TM does NOT guarantee forward progress - if there is contention, i.e. if the read and write sets of simultaneously executing transactions conflict, then ALL may abort, and retry. In the lack of other mechanisms, potentially forever. Intel's HW TM has an alternate path that one must use to provide that other mechanism. Probably one should code the alternate path to use locks. Worst case, stop everyone else. HLE, of course, provides semantics equivalent to, actually slightly stronger than, locks.
[toc] | [prev] | [next] | [standalone]
| From | EricP <ThatWouldBeTelling@thevillage.com> |
|---|---|
| Date | 2012-02-12 14:31 -0500 |
| Message-ID | <2OUZq.14447$9j5.3110@newsfe09.iad> |
| In reply to | #5878 |
Andy (Super) Glew wrote: > On 2/10/2012 4:43 PM, MitchAlsup wrote: >> On Friday, February 10, 2012 2:42:03 PM UTC-6, Unspecified wrote: >>> Other then Transaction Memory, is there any other contribution? >>> Also, I thought TM was a fallout of a decision taken in favour >>> of multicores. >> >> Transactional Memory is an attempt to get around the inability >> of ATOMIC events to make forward progress in the face of contention. >> Where Atomic events are what one can do wiht the currently available >> primatives. >> >> Mitch > > Hmm... > > AFAIK Intel's HW TM does NOT guarantee forward progress - if there is > contention, i.e. if the read and write sets of simultaneously executing > transactions conflict, then ALL may abort, and retry. In the lack of > other mechanisms, potentially forever. Intels RTM looks the same as AMD's ASF in that they are really a conflict detector rather than TM as they don't do conflict resolution. AIUI conflict resolution requires either the ability to NAK a cache line read, or delay a read response for what could be a very long time, both of which are apparently bus protocol faux pas. > Intel's HW TM has an alternate path that one must use to provide that > other mechanism. Probably one should code the alternate path to use > locks. Worst case, stop everyone else. Yeah. If thread A collides with B then those two must serialize with a lock, but it doesn't mean that C and D cannot proceed. Even if A has the lock it can still collide and must retry. But if A has multiple collisions then escalate and force everyone through the lock (this sets an upper bound on the progress delay). Because RTM doesn't do conflict resolution there will be situations where it is more expensive to use RTM/ASF than a lock. For example, a double linked list where all the pushes and pops occur at the list head. You either collide or you don't, and if you do then one has to wait. So RTM/ASF are no help. Eric
[toc] | [prev] | [next] | [standalone]
| From | Stefan Monnier <monnier@iro.umontreal.ca> |
|---|---|
| Date | 2012-02-12 21:50 -0500 |
| Message-ID | <jwvk43r259i.fsf-monnier+comp.arch@gnu.org> |
| In reply to | #5876 |
> Transactional Memory is an attempt to get around the inability
> of ATOMIC events to make forward progress in the face of contention.
Another perspective is that atomicity is a higher-level concept than
locks, which might lead to fewer bugs. IOW it is hoped that more
programmers are able to use correctly atomicity than locks. Also, it is
easier to automatically check that all shared objects are accessed
atomically rather than checking that the locks are used correctly.
Whether it's sufficiently better (and whether it can be made to perform)
is still up for debate, of course,
Stefan
[toc] | [prev] | [next] | [standalone]
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-02-12 19:45 -0800 |
| Message-ID | <4F38875A.2050407@SPAM.comp-arch.net> |
| In reply to | #5903 |
On 2/12/2012 6:50 PM, Stefan Monnier wrote: >> Transactional Memory is an attempt to get around the inability >> of ATOMIC events to make forward progress in the face of contention. > > Another perspective is that atomicity is a higher-level concept than > locks, which might lead to fewer bugs. IOW it is hoped that more > programmers are able to use correctly atomicity than locks. Also, it is > easier to automatically check that all shared objects are accessed > atomically rather than checking that the locks are used correctly. > > Whether it's sufficiently better (and whether it can be made to perform) > is still up for debate, of course, > > > Stefan What I don't like is that current hardware transactional memory doesn't scale, in the presence of both large footprints (large amounts of memory involved in the transaction), and in the presence of conflicts. Most HW TM that I am aware of - including the cache protocol I designed for it, which may or may not be related to what Ravi is now using - relies on snooping. E.g. if the transaction is too big too big to fit in the L1 cache, the transaction will never finish as a transaction. Or L2, or L3. Or even smaller data structures, before the cache. Worse, if there is thrashing in the ways, of a typically 4 way associative L1$, TM loses. I.e. the transaction will abort. Or a 16 way L2. Or ... I.e. a transaction may work as a transaction when running single threaded. But it may abort forever if running multithreaded thrashes the cache just a wee bit. So, you have to code a non-TM mechanism as a fallback. Probably using locks. -- Perhaps a nice aspect will be that transactions can act as a bug detector for locking. If you have a mutex, and the transaction always fails, then that tells you that probably not everyone is respecting the locks. (Or that the mutx is too big for the transaction.) -- What would a scalable implementation of hardware TM look like? Well, it can't depend on finite size caches, except as an accelerator. How about a log? Log all of the memory accesses in the transaction, possibly filtered and coalesced to reduce size. In the transaction commit phase, "lock" the lines in memory affected. This would require a directory - which might be scalable if there was a lock bit per line. (Which is one of the Holy Grails of computer architecture.) Once locked, commit. Possibly with a "was the value I read still there" approach to validation. Exercise for the reader in how to make this locking not prevent other transactions from going forward. And using the caches to filter as much work as possible - e.g. if it fits in the cache, don't use the log.
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-02-12 20:36 -0800 |
| Message-ID | <12562993.824.1329107776365.JavaMail.geo-discussion-forums@ynnj2> |
| In reply to | #5904 |
On Sunday, February 12, 2012 9:45:30 PM UTC-6, Andy (Super) Glew wrote: > What would a scalable implementation of hardware TM look like? Well, it > can't depend on finite size caches, except as an accelerator. May I suggest that scalability is going to require that the Software gain some understanding of the amounts and kinds of interference and then have the SW do something proactive about it. My original ASF had this feature, the current one does not. Mitch
[toc] | [prev] | [next] | [standalone]
| From | "Andy (Super) Glew" <andy@SPAM.comp-arch.net> |
|---|---|
| Date | 2012-02-13 06:46 -0800 |
| Message-ID | <4F392235.1020804@SPAM.comp-arch.net> |
| In reply to | #5906 |
On 2/12/2012 8:36 PM, MitchAlsup wrote: > On Sunday, February 12, 2012 9:45:30 PM UTC-6, Andy (Super) Glew wrote: > >> What would a scalable implementation of hardware TM look like? Well, it >> can't depend on finite size caches, except as an accelerator. > > May I suggest that scalability is going to require that the Software gain some understanding of the amounts and kinds of interference and then have the SW do something proactive about it. > > My original ASF had this feature, the current one does not. > > Mitch What does that look like? Beyond what Intel TM already does: provide a way for the transaction to fail, and for SW to resort to some other algorithm, probably lock based. The proactive part would be for SW to say "Hey, it doesn't look as if this transaction will ever pass as a transaction - because it uses too much data, and/or because there are hot spots - so I won't bother ever to use a HW transaction for it." Possibly with profiling feedback, e.g. EMON profiling feedback, so that SW can learn even when it can't predict statically. What more do you want? -- By the way, lots of people jump on the fact that Intel's current HW TM has no transactional escape - no way to do memory accesses from INSIDE the transaction that go outside the transaction. E.g. no way to call malloc() from inside a transaction without aborting (because malloc often updates shared pointers). (Yes, I know: you can code mallocs so that they only update shared stuff if private pools are exhausted.)
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <MitchAlsup@aol.com> |
|---|---|
| Date | 2012-02-13 08:58 -0800 |
| Message-ID | <16314332.739.1329152301273.JavaMail.geo-discussion-forums@ynnk21> |
| In reply to | #5911 |
On Monday, February 13, 2012 8:46:13 AM UTC-6, Andy (Super) Glew wrote: > What does that look like? We had a device in the machine (much like a TLB) that kept track of the order of synchronization events. When a conflict was detected, it provided a number (instead of yes/no) and that number could be used by SW to proactively avoid interference on the subsequent attempt. So, lets say an interrupt went off and 18 tasks simultaneously began to access a concurrent data structure. ON the first attempt at most 1 task makes forward progress and 17 others receive numbers from 1 through 17. On a subsequent access to the CDS, those tasks can use the number to index deeped intothe CDS and access an element that is not being simultaneously accessed my 16 others. Presto, on this worst case event in lock based CDS accesss O(N**2), ASF would give O(3), that is near best case. In average CDS accesses properly programmed ASF could achieve O(ln(n)) instead O(n**2) where n is the number in interfering acccessors. <snip> > What more do you want? Feedback from the HW that can be used by properly written SW to proactively avoid interference on subsequent accesses. Mitch
[toc] | [prev] | [next] | [standalone]
| From | "Paul A. Clayton" <paaronclayton@gmail.com> |
|---|---|
| Date | 2012-02-13 16:19 -0800 |
| Message-ID | <ae37e65b-43ea-42b6-80cf-d7ba5598bbdc@j14g2000vba.googlegroups.com> |
| In reply to | #5915 |
On Feb 13, 11:58 am, MitchAlsup <MitchAl...@aol.com> wrote: > On Monday, February 13, 2012 8:46:13 AM UTC-6, Andy (Super) Glew wrote: > > What does that look like? > > We had a device in the machine (much like a TLB) that kept > track of the order of synchronization events. When a conflict > was detected, it provided a number (instead of yes/no) and > that number could be used by SW to proactively avoid > interference on the subsequent attempt. > > So, lets say an interrupt went off and 18 tasks simultaneously > began to access a concurrent data structure. ON the first > attempt at most 1 task makes forward progress and 17 others The Intel RTM (and AMD ASF) specifications do not guarantee forward progress. (I am guessing that this is meant to allow simpler implementations while allowing a more complex implementation to provide such in general.) > receive numbers from 1 through 17. On a subsequent access to > the CDS, those tasks can use the number to index deeped into > the CDS and access an element that is not being simultaneously > accessed my 16 others. I suppose one could have a mildly bad case where another thread (that was busy before) wins the transaction for the first entry. (Cache block granularity of monitoring might also force a bloating of the data structure.) I wonder if something like a coallescing fetch-and-add could be useful in such cases. (It might even be possible to use something like coallescing fetch-and-add to support limited conflicting updates within transactions. Unfortunately, it seems that such would require all transactions 'later' than a failing transaction to fail [or a complex fix-up mechanism would be needed to provide the correct value and re-execute any dependent operations-- yuck!].) It seems that such a mechanism might be used to support a low overhead lock queue when the conflict is essential. Each thread could have a cache line that is monitored and written by the previous thread when it releases its hold on the lock. If the monitor was removed (e.g., by the OS scheduler), the OS would have to do something to avoid lock-up (perhaps change the next pointer in the previous entry or manage the lock queue in the OS). > Presto, on this worst case event in lock based CDS accesss > O(N**2), ASF would give O(3), that is near best case. In > average CDS accesses properly programmed ASF could achieve > O(ln(n)) instead O(n**2) where n is the number in interfering > acccessors. Did this provide an instruction or data address of the conflict or was it intended for simple transactions? > <snip> > > > What more do you want? > > Feedback from the HW that can be used by properly written SW to > proactively avoid interference on subsequent accesses. In addition to accessing a given entry in a queue, information _might_ be useful to break large transactions into separate non-conflicting pieces and use locking to handle the conflicting portion. Even known the degree of conflict could be useful, especially if one thread is (nearly) guaranteed to make forward progress. In that case, just retrying the transaction may be more appropriate (without additional delay) may be appropriate).
[toc] | [prev] | [next] | [standalone]
| From | Rick Jones <rick.jones2@hp.com> |
|---|---|
| Date | 2012-02-14 03:55 +0000 |
| Message-ID | <jhclv8$h80$1@usenet01.boi.hp.com> |
| In reply to | #5920 |
Paul A. Clayton <paaronclayton@gmail.com> wrote: > On Feb 13, 11:58 am, MitchAlsup <MitchAl...@aol.com> wrote: > > On Monday, February 13, 2012 8:46:13 AM UTC-6, Andy (Super) Glew wrote: > > > What does that look like? > > > > We had a device in the machine (much like a TLB) that kept > > track of the order of synchronization events. When a conflict > > was detected, it provided a number (instead of yes/no) and > > that number could be used by SW to proactively avoid > > interference on the subsequent attempt. > > > > So, lets say an interrupt went off and 18 tasks simultaneously > > began to access a concurrent data structure. ON the first > > attempt at most 1 task makes forward progress and 17 others > The Intel RTM (and AMD ASF) specifications do not guarantee > forward progress. (I am guessing that this is meant to allow > simpler implementations while allowing a more complex > implementation to provide such in general.) It may not really apply at all, but for some reason the lack of a guarantee of forward progress reminds me of CSMA/CD in Ethernet and "capture effect" which was prevalent before full-duplex became the norm. rick jones -- the road to hell is paved with business decisions... these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <"terje.mathisen at tmsw.no"> |
|---|---|
| Date | 2012-02-14 10:30 +0100 |
| Message-ID | <u5mp09-p191.ln1@ntp6.tmsw.no> |
| In reply to | #5923 |
Rick Jones wrote: > It may not really apply at all, but for some reason the lack of a > guarantee of forward progress reminds me of CSMA/CD in Ethernet and > "capture effect" which was prevalent before full-duplex became the > norm. Standard Ethernet has one huge advantage: Exponential randomized backoff is part of the spec. As long as everyone implements this you get stable operation and guaranteed forward progress. (This would correspond to having a similar exponential wait in every busy loop retrying a lock, everyone accessing the lock using the same algorithm. This is not going to happen. :-() This has of course let some unscrupulous vendors make "Gaming Ethernet" cards that intentionally breaks the spec, either by waiting less than the prescribed minimum time after a collisions and/or by retrying much more rapidly than allowed.) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Andrew Reilly <areilly---@bigpond.net.au> |
|---|---|
| Date | 2012-02-14 10:49 +0000 |
| Message-ID | <9push9Ft0mU1@mid.individual.net> |
| In reply to | #5925 |
On Tue, 14 Feb 2012 10:30:05 +0100, Terje Mathisen wrote: > This would correspond to having a similar exponential wait in every busy > loop retrying a lock, everyone accessing the lock using the same > algorithm. This is not going to happen. :-( Isn't it? I was under the impression that lock contention was a single- process phenomon, which is essentially the prerequisite for having control over that sort of thing. Surely the only thing between this happening and the current situation is a successful library that does this by default? Given the success of Ethernet, I am quite surprised that the same approach doesn't seem to be common or popular for lock contention. Perhaps because there are still many different use scenarios, and the fact that operating systems aren't famously good at reliable short time- outs. Cheers, -- Andrew
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <"terje.mathisen at tmsw.no"> |
|---|---|
| Date | 2012-02-14 13:21 +0100 |
| Message-ID | <h60q09-sd91.ln1@ntp6.tmsw.no> |
| In reply to | #5927 |
Andrew Reilly wrote: > On Tue, 14 Feb 2012 10:30:05 +0100, Terje Mathisen wrote: > >> This would correspond to having a similar exponential wait in every busy >> loop retrying a lock, everyone accessing the lock using the same >> algorithm. This is not going to happen. :-( > > Isn't it? I was under the impression that lock contention was a single- > process phenomon, which is essentially the prerequisite for having > control over that sort of thing. Surely the only thing between this > happening and the current situation is a successful library that does > this by default? > > Given the success of Ethernet, I am quite surprised that the same > approach doesn't seem to be common or popular for lock contention. > Perhaps because there are still many different use scenarios, and the > fact that operating systems aren't famously good at reliable short time- > outs. I think the main problem is that the performance goes down so very quickly, i.e. any given programmer will believe that "I can do much better, I don't need all the overhead of a random exp backoff, I can simply spin a couple of times until the current owner is finished." I.e. it works well now, on the developer's quad laptop, it won't work at all in live environment with 100+ cores and 1K+ processes. :-( Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →
Back to top | Article view | comp.arch
csiph-web