Groups > comp.arch > #5703 > unrolled thread

Single Thread Performance

Started by	"Unspecified" <partha@perfectvips.com>
First post	2012-02-04 21:54 +0530
Last post	2012-02-08 01:04 -0800
Articles	20 on this page of 72 — 25 participants

Back to article view | Back to comp.arch

  Single Thread Performance "Unspecified" <partha@perfectvips.com> - 2012-02-04 21:54 +0530
    Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-06 05:55 -0600
      Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-06 13:06 +0000
        Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 14:12 -0500
          Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-06 13:36 -0700
            Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-06 20:47 +0000
              Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-06 15:07 -0700
                Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-06 16:32 -0600
                  Re: Single Thread Performance Robert Wessel <robertwessel2@yahoo.com> - 2012-02-06 17:45 -0600
                    Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-07 06:01 -0600
                      Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-07 13:32 -0700
                        Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-09 19:08 +0000
                          Re: Single Thread Performance BGB <cr88192@hotmail.com> - 2012-02-10 08:56 -0700
          Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-06 20:42 +0000
            Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 19:36 -0500
            Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-06 18:28 -0800
              Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 22:23 -0500
              Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-07 06:52 +0000
          Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-06 12:10 -0800
            Re: Single Thread Performance Thomas Womack <twomack@chiark.greenend.org.uk> - 2012-02-07 10:13 +0000
              Re: Single Thread Performance Brett Davis <ggtgp@yahoo.com> - 2012-02-20 23:58 -0600
            Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-07 17:33 +0000
              Re: Single Thread Performance nedbrek <nedbrek@yahoo.com> - 2012-02-15 08:10 -0500
      Re: Single Thread Performance Robert Myers <rbmyersusa@gmail.com> - 2012-02-06 14:17 -0500
        Re: Single Thread Performance del cecchi <delcecchi@gmail.com> - 2012-02-25 22:07 -0800
      Re: Single Thread Performance jgk@panix.com (Joe keane) - 2012-02-07 17:57 +0000
    Re: Single Thread Performance Quadibloc <jsavard@ecn.ab.ca> - 2012-02-05 13:13 -0800
    Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-05 21:35 -0800
      Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-07 17:38 +0000
        Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-07 14:54 -0600
          Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-07 21:33 +0000
            Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-07 23:13 -0600
              Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-08 18:54 +0000
                Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-08 15:17 -0600
                  Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-09 08:13 +0100
                  Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-09 17:08 +0000
                    Re: Single Thread Performance Stephen Sprunk <stephen@sprunk.org> - 2012-02-09 16:01 -0600
                Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-09 07:56 +0100
                  Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-09 17:18 +0000
            Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-08 10:17 +0100
          Re: Single Thread Performance Jon <jon@beniston.com> - 2012-02-08 05:32 -0800
        Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-07 16:00 -0800
          Re: Single Thread Performance timcaffrey@aol.com (Tim McCaffrey) - 2012-02-08 18:35 +0000
      Re: Single Thread Performance Partha <parthaspanda22@gmail.com> - 2012-02-10 11:32 -0800
        Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-10 20:31 +0000
          Re: Single Thread Performance "Unspecified" <partha@perfectvips.com> - 2012-02-11 02:12 +0530
            Re: Single Thread Performance nmm1@cam.ac.uk - 2012-02-10 21:04 +0000
            Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-10 16:43 -0800
              Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-10 19:48 -0800
                Re: Single Thread Performance EricP <ThatWouldBeTelling@thevillage.com> - 2012-02-12 14:31 -0500
              Re: Single Thread Performance Stefan Monnier <monnier@iro.umontreal.ca> - 2012-02-12 21:50 -0500
                Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-12 19:45 -0800
                  Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-12 20:36 -0800
                    Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-13 06:46 -0800
                      Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-13 08:58 -0800
                        Re: Single Thread Performance "Paul A. Clayton" <paaronclayton@gmail.com> - 2012-02-13 16:19 -0800
                          Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 03:55 +0000
                            Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-14 10:30 +0100
                              Re: Single Thread Performance Andrew Reilly <areilly---@bigpond.net.au> - 2012-02-14 10:49 +0000
                                Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-14 13:21 +0100
                                  Re: Single Thread Performance Stephen Fuld <SFuld@alumni.cmu.edu.invalid> - 2012-02-14 13:11 -0800
                              Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-14 09:29 -0800
                              Re: Single Thread Performance EricP <ThatWouldBeTelling@thevillage.com> - 2012-02-14 12:40 -0500
                                Re: Single Thread Performance EricP <ThatWouldBeTelling@thevillage.com> - 2012-02-14 16:12 -0500
                                Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 21:14 +0000
                                  Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 21:16 +0000
                              Re: Single Thread Performance Rick Jones <rick.jones2@hp.com> - 2012-02-14 21:09 +0000
                          Re: Single Thread Performance MitchAlsup <MitchAlsup@aol.com> - 2012-02-14 09:26 -0800
                            Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-15 08:44 +0100
                              Re: Single Thread Performance "Andy (Super) Glew" <andy@SPAM.comp-arch.net> - 2012-02-15 01:07 -0800
                        Re: Single Thread Performance Terje Mathisen <"terje.mathisen at tmsw.no"> - 2012-02-14 10:16 +0100
    Re: Single Thread Performance Michael S <already5chosen@yahoo.com> - 2012-02-08 01:04 -0800

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

#5816

From	Jon <jon@beniston.com>
Date	2012-02-08 05:32 -0800
Message-ID	<ba6087e6-12e2-4df3-ab75-c86662934d69@e27g2000vbu.googlegroups.com>
In reply to	#5803

> How?  That would necessitate going from six (3+3) to fifteen (5+5+5)
> operand bits per instruction.
>
> Granted, you would eliminate some instructions by reducing stack spills
> and register copies, but it's hard to imagine that would be enough to
> overcome increasing the size of _every_ instruction by an entire byte.

You don't need to increase every instruction. Not every instruction
requires 3 operands or always benefits from them and obviously not
every function requires that many registers.

The eSi-RISC architecture, for example, has 16-bit instructions that
support 2 regs and access to the lowest 16-regs, as well as 32-bit
instructions that support 3 operands and access to 32 regs.

Cheers,
Jon

[toc] | [prev] | [next] | [standalone]

#5809

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-02-07 16:00 -0800
Message-ID	<9164712.4469.1328659228862.JavaMail.geo-discussion-forums@vbbfd4>
In reply to	#5795

On Tuesday, February 7, 2012 11:38:32 AM UTC-6, Tim McCaffrey wrote:
> I seem to remember you quoting similar numbers for 32 vs 16 registers, 

I quote this as 3%-ish (compiled code)

> and 3 operand vs 2 operand ISAs. 

Don't remember ever quoting a delta here.

> So, design an ISA with 32 registers, 3 operand opcodes that is just as code 
> dense as an x86 (for I-Cache efficiency),

Don't know if any RISC encoding ever got this close to x86 code density.

> double the area, cost and power and 
> you might get anywhere from 15-50% performance increase. 

Given a several-wide x86 machine (like say an Opteron) in an already advanced process (like say Intel FABs): I you can double the power budget, you can probably get 10% more ILP and probably 30% more frequency.

Mitch

[toc] | [prev] | [next] | [standalone]

#5821

From	timcaffrey@aol.com (Tim McCaffrey)
Date	2012-02-08 18:35 +0000
Message-ID	<jgufaa$4l4$1@USTR-NEWS.TR.UNISYS.COM>
In reply to	#5809

In article 
<9164712.4469.1328659228862.JavaMail.geo-discussion-forums@vbbfd4>, 
MitchAlsup@aol.com says...
>
>On Tuesday, February 7, 2012 11:38:32 AM UTC-6, Tim McCaffrey wrote:
>
>> and 3 operand vs 2 operand ISAs. 
>
>Don't remember ever quoting a delta here.
>

Perhaps it was an old John Mashey post I was thinking of.

	- Tim

[toc] | [prev] | [next] | [standalone]

#5870

From	Partha <parthaspanda22@gmail.com>
Date	2012-02-10 11:32 -0800
Message-ID	<cc21bb69-16bf-4692-a8b4-7244bdc87a94@vd8g2000pbc.googlegroups.com>
In reply to	#5777

> Single threaded performance is pretty much tapped out. Given the current cores one might be able to get 5%-to-15% more out of a single thread at double the cost, area, and power.
>
> So, the economies of scale and basic limits have turned the tables, so that one can get more performance from 2,3,4,6,8,... cores than one can get from an equivalent amount of area/power/engineering into single threaded architecture.

What if one looked at what's upstream? Like, programming languages
for instance. It could be that a hardware issue(process technology)
and
an easy formula(Moore's law) could have been aimed at rather
then exercising other options?

Sincerely,
Partha Sarathi Panda.

[toc] | [prev] | [next] | [standalone]

#5871

From	nmm1@cam.ac.uk
Date	2012-02-10 20:31 +0000
Message-ID	<jh3uqm$c2$1@gosset.csi.cam.ac.uk>
In reply to	#5870

In article <cc21bb69-16bf-4692-a8b4-7244bdc87a94@vd8g2000pbc.googlegroups.com>,
Partha  <parthaspanda22@gmail.com> wrote:
>>
>> Single threaded performance is pretty much tapped out. Given the
>> current cores one might be able to get 5%-to-15% more out of a
>> single thread at double the cost, area, and power.
>>
>> So, the economies of scale and basic limits have turned the tables,
>> so that one can get more performance from 2,3,4,6,8,... cores than
>> one can get from an equivalent amount of area/power/engineering into
>> single threaded architecture.
>
>What if one looked at what's upstream? Like, programming languages
>for instance. It could be that a hardware issue(process technology)
>and an easy formula(Moore's law) could have been aimed at rather
>then exercising other options?

Er, yes.  I have been posting on that topic for a decade now!
There is a lot going on in programming languages, though very
little is as radical as I would favour.


Regards,
Nick Maclaren.

[toc] | [prev] | [next] | [standalone]

#5872

From	"Unspecified" <partha@perfectvips.com>
Date	2012-02-11 02:12 +0530
Message-ID	<jh3veo$ojd$1@adenine.netfront.net>
In reply to	#5871

> Er, yes.  I have been posting on that topic for a decade now!
> There is a lot going on in programming languages, though very
> little is as radical as I would favour.
>

Other then Transaction Memory, is there any other contribution?
Also, I thought TM was a fallout of a decision taken in favour
of multicores.

Maybe I missed out on the fun in the programming languages.

-- 

Sincerely,
Partha Sarathi Panda.

 



--- Posted via news://freenews.netfront.net/ - Complaints to news@netfront.net ---

[toc] | [prev] | [next] | [standalone]

#5874

From	nmm1@cam.ac.uk
Date	2012-02-10 21:04 +0000
Message-ID	<jh40og$u5$1@gosset.csi.cam.ac.uk>
In reply to	#5872

In article <jh3veo$ojd$1@adenine.netfront.net>,
Unspecified <partha@perfectvips.com> wrote:
>
>> Er, yes.  I have been posting on that topic for a decade now!
>> There is a lot going on in programming languages, though very
>> little is as radical as I would favour.
>
>Other then Transaction Memory, is there any other contribution?
>Also, I thought TM was a fallout of a decision taken in favour
>of multicores.

Look up OpenMP, Coarrays, C++11, Cilk, IBM X10, Cray Chapel,
Sun Fortress and more.


Regards,
Nick Maclaren.

[toc] | [prev] | [next] | [standalone]

#5876

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-02-10 16:43 -0800
Message-ID	<18985540.1742.1328921030856.JavaMail.geo-discussion-forums@yqks7>
In reply to	#5872

On Friday, February 10, 2012 2:42:03 PM UTC-6, Unspecified wrote:
> Other then Transaction Memory, is there any other contribution?
> Also, I thought TM was a fallout of a decision taken in favour
> of multicores.

Transactional Memory is an attempt to get around the inability 
of ATOMIC events to make forward progress in the face of contention.
Where Atomic events are what one can do wiht the currently available 
primatives.

Mitch

[toc] | [prev] | [next] | [standalone]

#5878

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-02-10 19:48 -0800
Message-ID	<4F35E51B.7010903@SPAM.comp-arch.net>
In reply to	#5876

On 2/10/2012 4:43 PM, MitchAlsup wrote:
> On Friday, February 10, 2012 2:42:03 PM UTC-6, Unspecified wrote:
>> Other then Transaction Memory, is there any other contribution?
>> Also, I thought TM was a fallout of a decision taken in favour
>> of multicores.
>
> Transactional Memory is an attempt to get around the inability
> of ATOMIC events to make forward progress in the face of contention.
> Where Atomic events are what one can do wiht the currently available
> primatives.
>
> Mitch

Hmm...

AFAIK Intel's HW TM does NOT guarantee forward progress - if there is 
contention, i.e. if the read and write sets of simultaneously executing 
transactions conflict, then ALL may abort, and retry.  In the lack of 
other mechanisms, potentially forever.

Intel's HW TM has an alternate path that one must use to provide that 
other mechanism.  Probably one should code the alternate path to use 
locks. Worst case, stop everyone else.

HLE, of course, provides semantics equivalent to, actually slightly 
stronger than, locks.

[toc] | [prev] | [next] | [standalone]

#5897

From	EricP <ThatWouldBeTelling@thevillage.com>
Date	2012-02-12 14:31 -0500
Message-ID	<2OUZq.14447$9j5.3110@newsfe09.iad>
In reply to	#5878

Andy (Super) Glew wrote:
> On 2/10/2012 4:43 PM, MitchAlsup wrote:
>> On Friday, February 10, 2012 2:42:03 PM UTC-6, Unspecified wrote:
>>> Other then Transaction Memory, is there any other contribution?
>>> Also, I thought TM was a fallout of a decision taken in favour
>>> of multicores.
>>
>> Transactional Memory is an attempt to get around the inability
>> of ATOMIC events to make forward progress in the face of contention.
>> Where Atomic events are what one can do wiht the currently available
>> primatives.
>>
>> Mitch
> 
> Hmm...
> 
> AFAIK Intel's HW TM does NOT guarantee forward progress - if there is 
> contention, i.e. if the read and write sets of simultaneously executing 
> transactions conflict, then ALL may abort, and retry.  In the lack of 
> other mechanisms, potentially forever.

Intels RTM looks the same as AMD's ASF in that they
are really a conflict detector rather than TM as
they don't do conflict resolution.

AIUI conflict resolution requires either the ability to NAK
a cache line read, or delay a read response for what could be a very
long time, both of which are apparently bus protocol faux pas.

> Intel's HW TM has an alternate path that one must use to provide that 
> other mechanism.  Probably one should code the alternate path to use 
> locks. Worst case, stop everyone else.

Yeah. If thread A collides with B then those two must serialize
with a lock, but it doesn't mean that C and D cannot proceed.
Even if A has the lock it can still collide and must retry.
But if A has multiple collisions then escalate and force everyone
through the lock (this sets an upper bound on the progress delay).

Because RTM doesn't do conflict resolution there will be
situations where it is more expensive to use RTM/ASF than a lock.

For example, a double linked list where all the pushes and pops
occur at the list head. You either collide or you don't,
and if you do then one has to wait. So RTM/ASF are no help.

Eric

[toc] | [prev] | [next] | [standalone]

#5903

From	Stefan Monnier <monnier@iro.umontreal.ca>
Date	2012-02-12 21:50 -0500
Message-ID	<jwvk43r259i.fsf-monnier+comp.arch@gnu.org>
In reply to	#5876

> Transactional Memory is an attempt to get around the inability
> of ATOMIC events to make forward progress in the face of contention.

Another perspective is that atomicity is a higher-level concept than
locks, which might lead to fewer bugs.  IOW it is hoped that more
programmers are able to use correctly atomicity than locks.  Also, it is
easier to automatically check that all shared objects are accessed
atomically rather than checking that the locks are used correctly.

Whether it's sufficiently better (and whether it can be made to perform)
is still up for debate, of course,


        Stefan

[toc] | [prev] | [next] | [standalone]

#5904

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-02-12 19:45 -0800
Message-ID	<4F38875A.2050407@SPAM.comp-arch.net>
In reply to	#5903

On 2/12/2012 6:50 PM, Stefan Monnier wrote:
>> Transactional Memory is an attempt to get around the inability
>> of ATOMIC events to make forward progress in the face of contention.
>
> Another perspective is that atomicity is a higher-level concept than
> locks, which might lead to fewer bugs.  IOW it is hoped that more
> programmers are able to use correctly atomicity than locks.  Also, it is
> easier to automatically check that all shared objects are accessed
> atomically rather than checking that the locks are used correctly.
>
> Whether it's sufficiently better (and whether it can be made to perform)
> is still up for debate, of course,
>
>
>          Stefan

What I don't like is that current hardware transactional memory doesn't 
scale, in the presence of both large footprints (large amounts of memory 
involved in the transaction), and in the presence of conflicts.

Most HW TM that I am aware of - including the cache protocol I designed 
for it, which may or may not be related to what Ravi is now using - 
relies on snooping.  E.g. if the transaction is too big too big to fit 
in the L1 cache, the transaction will never finish as a transaction. Or 
L2, or L3.  Or even smaller data structures, before the cache.

Worse, if there is thrashing in the ways, of a typically 4 way 
associative L1$, TM loses. I.e. the transaction will abort. Or a 16 way 
L2. Or ...

I.e. a transaction may work as a transaction when running single 
threaded.  But it may abort forever if running multithreaded thrashes 
the cache just a wee bit.

So, you have to code a non-TM mechanism as a fallback. Probably using 
locks.

--

Perhaps a nice aspect will be that transactions can act as a bug 
detector for locking.   If you have a mutex, and the transaction always 
fails, then that tells you that probably not everyone is respecting the 
locks. (Or that the mutx is too big for the transaction.)

--

What would a scalable implementation of hardware TM look like?  Well, it 
can't depend on finite size caches, except as an accelerator.

How about a log?  Log all of the memory accesses in the transaction, 
possibly filtered and coalesced to reduce size.

In the transaction commit phase, "lock" the lines in memory affected. 
This would require a directory - which might be scalable if there was a 
lock bit per line.  (Which is one of the Holy Grails of computer 
architecture.)

Once locked, commit.  Possibly with a "was the value I read still there" 
approach to validation.

Exercise for the reader in how to make this locking not prevent other 
transactions from going forward.  And using the caches to filter as much 
work as possible - e.g. if it fits in the cache, don't use the log.

[toc] | [prev] | [next] | [standalone]

#5906

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-02-12 20:36 -0800
Message-ID	<12562993.824.1329107776365.JavaMail.geo-discussion-forums@ynnj2>
In reply to	#5904

On Sunday, February 12, 2012 9:45:30 PM UTC-6, Andy (Super) Glew wrote:

> What would a scalable implementation of hardware TM look like?  Well, it 
> can't depend on finite size caches, except as an accelerator.

May I suggest that scalability is going to require that the Software gain some understanding of the amounts and kinds of interference and then have the SW do something proactive about it.

My original ASF had this feature, the current one does not.

Mitch

[toc] | [prev] | [next] | [standalone]

#5911

From	"Andy (Super) Glew" <andy@SPAM.comp-arch.net>
Date	2012-02-13 06:46 -0800
Message-ID	<4F392235.1020804@SPAM.comp-arch.net>
In reply to	#5906

On 2/12/2012 8:36 PM, MitchAlsup wrote:
> On Sunday, February 12, 2012 9:45:30 PM UTC-6, Andy (Super) Glew wrote:
>
>> What would a scalable implementation of hardware TM look like?  Well, it
>> can't depend on finite size caches, except as an accelerator.
>
> May I suggest that scalability is going to require that the Software gain some understanding of the amounts and kinds of interference and then have the SW do something proactive about it.
>
> My original ASF had this feature, the current one does not.
>
> Mitch

What does that look like?

Beyond what Intel TM already does: provide a way for the transaction to 
fail, and for SW to resort to some other algorithm, probably lock based.

The proactive part would be for SW to say "Hey, it doesn't look as if 
this transaction will ever pass as a transaction - because it uses too 
much data, and/or because there are hot spots - so I won't bother ever 
to use a HW transaction for it."   Possibly with profiling feedback, 
e.g. EMON profiling feedback, so that SW can learn even when it can't 
predict statically.

What more do you want?

--

By the way, lots of people jump on the fact that Intel's current HW TM 
has no transactional escape - no way to do memory accesses from INSIDE 
the transaction that go outside the transaction.  E.g. no way to call 
malloc() from inside a transaction without aborting (because malloc 
often updates shared pointers).  (Yes, I know:  you can code mallocs so 
that they only update shared stuff if private pools are exhausted.)

[toc] | [prev] | [next] | [standalone]

#5915

From	MitchAlsup <MitchAlsup@aol.com>
Date	2012-02-13 08:58 -0800
Message-ID	<16314332.739.1329152301273.JavaMail.geo-discussion-forums@ynnk21>
In reply to	#5911

On Monday, February 13, 2012 8:46:13 AM UTC-6, Andy (Super) Glew wrote:
> What does that look like?

We had a device in the machine (much like a TLB) that kept track of the order of synchronization events. When a conflict was detected, it provided a number (instead of yes/no) and that number could be used by SW to proactively avoid interference on the subsequent attempt.

So, lets say an interrupt went off and 18 tasks simultaneously began to access a concurrent data structure. ON the first attempt at most 1 task makes forward progress and 17 others receive numbers from 1 through 17. On a subsequent access to the CDS, those tasks can use the number to index deeped intothe CDS and access an element that is not being simultaneously accessed my 16 others.

Presto, on this worst case event in lock based CDS accesss O(N**2), ASF would give O(3), that is near best case. In average CDS accesses properly programmed ASF could achieve O(ln(n)) instead O(n**2) where n is the number in interfering acccessors.
<snip>
> What more do you want?

Feedback from the HW that can be used by properly written SW to proactively avoid interference on subsequent accesses.

Mitch

[toc] | [prev] | [next] | [standalone]

#5920

From	"Paul A. Clayton" <paaronclayton@gmail.com>
Date	2012-02-13 16:19 -0800
Message-ID	<ae37e65b-43ea-42b6-80cf-d7ba5598bbdc@j14g2000vba.googlegroups.com>
In reply to	#5915

On Feb 13, 11:58 am, MitchAlsup <MitchAl...@aol.com> wrote:
> On Monday, February 13, 2012 8:46:13 AM UTC-6, Andy (Super) Glew wrote:
> > What does that look like?
>
> We had a device in the machine (much like a TLB) that kept
> track of the order of synchronization events. When a conflict
> was detected, it provided a number (instead of yes/no) and
> that number could be used by SW to proactively avoid
> interference on the subsequent attempt.
>
> So, lets say an interrupt went off and 18 tasks simultaneously
> began to access a concurrent data structure. ON the first
> attempt at most 1 task makes forward progress and 17 others

The Intel RTM (and AMD ASF) specifications do not guarantee
forward progress.  (I am guessing that this is meant to allow
simpler implementations while allowing a more complex
implementation to provide such in general.)

> receive numbers from 1 through 17. On a subsequent access to
> the CDS, those tasks can use the number to index deeped into
> the CDS and access an element that is not being simultaneously
> accessed my 16 others.

I suppose one could have a mildly bad case where another thread
(that was busy before) wins the transaction for the first
entry.  (Cache block granularity of monitoring might also
force a bloating of the data structure.)

I wonder if something like a coallescing fetch-and-add
could be useful in such cases.  (It might even be
possible to use something like coallescing fetch-and-add
to support limited conflicting updates within transactions.
Unfortunately, it seems that such would require all
transactions 'later' than a failing transaction to fail
[or a complex fix-up mechanism would be needed to provide
the correct value and re-execute any dependent operations--
yuck!].)

It seems that such a mechanism might be used to support a
low overhead lock queue when the conflict is essential.
Each thread could have a cache line that is monitored and
written by the previous thread when it releases its hold
on the lock.  If the monitor was removed (e.g., by the OS
scheduler), the OS would have to do something to avoid
lock-up (perhaps change the next pointer in the previous
entry or manage the lock queue in the OS).

> Presto, on this worst case event in lock based CDS accesss
> O(N**2), ASF would give O(3), that is near best case. In
> average CDS accesses properly programmed ASF could achieve
> O(ln(n)) instead O(n**2) where n is the number in interfering
> acccessors.

Did this provide an instruction or data address of the conflict
or was it intended for simple transactions?

> <snip>
>
> > What more do you want?
>
> Feedback from the HW that can be used by properly written SW to
> proactively avoid interference on subsequent accesses.

In addition to accessing a given entry in a queue, information
_might_ be useful to break large transactions into separate
non-conflicting pieces and use locking to handle the conflicting
portion.

Even known the degree of conflict could be useful, especially
if one thread is (nearly) guaranteed to make forward progress.
In that case, just retrying the transaction may be more
appropriate (without additional delay) may be appropriate).

[toc] | [prev] | [next] | [standalone]

#5923

From	Rick Jones <rick.jones2@hp.com>
Date	2012-02-14 03:55 +0000
Message-ID	<jhclv8$h80$1@usenet01.boi.hp.com>
In reply to	#5920

Paul A. Clayton <paaronclayton@gmail.com> wrote:
> On Feb 13, 11:58 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > On Monday, February 13, 2012 8:46:13 AM UTC-6, Andy (Super) Glew wrote:
> > > What does that look like?
> >
> > We had a device in the machine (much like a TLB) that kept
> > track of the order of synchronization events. When a conflict
> > was detected, it provided a number (instead of yes/no) and
> > that number could be used by SW to proactively avoid
> > interference on the subsequent attempt.
> >
> > So, lets say an interrupt went off and 18 tasks simultaneously
> > began to access a concurrent data structure. ON the first
> > attempt at most 1 task makes forward progress and 17 others

> The Intel RTM (and AMD ASF) specifications do not guarantee
> forward progress.  (I am guessing that this is meant to allow
> simpler implementations while allowing a more complex
> implementation to provide such in general.)

It may not really apply at all, but for some reason the lack of a
guarantee of forward progress reminds me of CSMA/CD in Ethernet and
"capture effect" which was prevalent before full-duplex became the
norm.

rick jones
-- 
the road to hell is paved with business decisions...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com  but NOT BOTH...

[toc] | [prev] | [next] | [standalone]

#5925

From	Terje Mathisen <"terje.mathisen at tmsw.no">
Date	2012-02-14 10:30 +0100
Message-ID	<u5mp09-p191.ln1@ntp6.tmsw.no>
In reply to	#5923

Rick Jones wrote:
> It may not really apply at all, but for some reason the lack of a
> guarantee of forward progress reminds me of CSMA/CD in Ethernet and
> "capture effect" which was prevalent before full-duplex became the
> norm.

Standard Ethernet has one huge advantage:

Exponential randomized backoff is part of the spec. As long as everyone 
implements this you get stable operation and guaranteed forward progress.

(This would correspond to having a similar exponential wait in every 
busy loop retrying a lock, everyone accessing the lock using the same 
algorithm. This is not going to happen. :-()

This has of course let some unscrupulous vendors make "Gaming Ethernet" 
cards that intentionally breaks the spec, either by waiting less than 
the prescribed minimum time after a collisions and/or by retrying much 
more rapidly than allowed.)

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#5927

From	Andrew Reilly <areilly---@bigpond.net.au>
Date	2012-02-14 10:49 +0000
Message-ID	<9push9Ft0mU1@mid.individual.net>
In reply to	#5925

On Tue, 14 Feb 2012 10:30:05 +0100, Terje Mathisen wrote:

> This would correspond to having a similar exponential wait in every busy
> loop retrying a lock, everyone accessing the lock using the same
> algorithm. This is not going to happen. :-(

Isn't it?  I was under the impression that lock contention was a single-
process phenomon, which is essentially the prerequisite for having 
control over that sort of thing.  Surely the only thing between this 
happening and the current situation is a successful library that does 
this by default?

Given the success of Ethernet, I am quite surprised that the same 
approach doesn't seem to be common or popular for lock contention.  
Perhaps because there are still many different use scenarios, and the 
fact that operating systems aren't famously good at reliable short time-
outs.

Cheers,

-- 
Andrew

[toc] | [prev] | [next] | [standalone]

#5930

From	Terje Mathisen <"terje.mathisen at tmsw.no">
Date	2012-02-14 13:21 +0100
Message-ID	<h60q09-sd91.ln1@ntp6.tmsw.no>
In reply to	#5927

Andrew Reilly wrote:
> On Tue, 14 Feb 2012 10:30:05 +0100, Terje Mathisen wrote:
>
>> This would correspond to having a similar exponential wait in every busy
>> loop retrying a lock, everyone accessing the lock using the same
>> algorithm. This is not going to happen. :-(
>
> Isn't it?  I was under the impression that lock contention was a single-
> process phenomon, which is essentially the prerequisite for having
> control over that sort of thing.  Surely the only thing between this
> happening and the current situation is a successful library that does
> this by default?
>
> Given the success of Ethernet, I am quite surprised that the same
> approach doesn't seem to be common or popular for lock contention.
> Perhaps because there are still many different use scenarios, and the
> fact that operating systems aren't famously good at reliable short time-
> outs.

I think the main problem is that the performance goes down so very 
quickly, i.e. any given programmer will believe that "I can do much 
better, I don't need all the overhead of a random exp backoff, I can 
simply spin a couple of times until the current owner is finished."

I.e. it works well now, on the developer's quad laptop, it won't work at 
all in live environment with 100+ cores and 1K+ processes. :-(

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

csiph-web

Single Thread Performance

Contents

#5816

#5809

#5821

#5870

#5871

#5872

#5874

#5876

#5878

#5897

#5903

#5904

#5906

#5911

#5915

#5920

#5923

#5925

#5927

#5930