We have to be smarter than that

From	aminer <aminer@toto.net>
Newsgroups	comp.lang.pascal.misc
Subject	We have to be smarter than that
Date	2014-04-27 22:11 -0700
Organization	albasani.net
Message-ID	<ljkdcf$eso$3@news.albasani.net> (permalink)

Show all headers | View raw

Hello all,

We have to be smarter than that...

Today i have isolated the cause have caused the 33% less
throughput in my algorithm than the Chriss Thomasson algorithm that
uses the bakery algorithm...

So follow with me cause it's very important to understand what
i will explain to you.. as i have told you before, the PH.Ds papers 
about scalable algorithms  have a kind of weakness in them, cause when 
they analyze there algorithms from the point of view of the 
"performance"they don't do it efficiently, by efficient i mean they have 
to spot all the variables that generate data movements between
the caches or from the memory subsystem and the local caches, this
is how i have understood what is the cause of the 33% less throughput
in my concurrent FIFO Queue compared to the Chriss Thomasson concurrent 
Queue.. so follow with me please...

I have said yesterday that it was not the "factor" that we call 
"contention" that caused the drop in the throughput, and i have 
explained to you this by saying that i was using an efficient backoff 
mechanism , but i have thought more and more about this subject, and i 
think that even though i am using a backoff mechanism there is still 
contention there when we are using a few number of threads on a few 
number of cores... so follow with me i will explain to you my findings...


Here is my algorithm:

http://pages.videotron.com/aminer/CQueue2.htm


I have said that we have to spot all the variables that generate
data movements between the cache and between the memory subsystem
and the cache, why ? cause those data movement can cause contention
that can higher the waiting time and lower the throughput...


So look at the following code of my pop() method:

==

function TWQueue.pop(var obj:long):boolean;

var lastTail,newtemp,tail1,head1,count: long;
     i:integer;
begin

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newtemp-1;

repeat
if fcount1^[lastTail and fMask].flag=1 then break;
sleep(0);
until (false);

obj:=getObject(lastTail);

repeat

head1:=lastTail;
tail1:=tail;
   if head1 < tail1
        then count:= (High(long)-tail1)+(1+head1)
        else count:=(head1-tail1);TTicketspinlock

if count>0 then for i:=0 to count*40 do asm pause end;

if tail=lasttail
    then
       begin
        fcount1^[lastTail and fMask].flag:=0;
        tail:=newtemp;
        exit;
       end;
sleep(0);
until false;

end;
==


So the variables that generate data movements between the caches
and the memory system and the local caches are:


The "temp" variable in the "LockedIncLong(temp)" do generate data 
movements also...

The "fcount1^[lastTail and fMask].flag" variable in the
"if fcount1^[lastTail and fMask].flag=1 then break;" do generate data 
movements also...

The getObject(lastTail) do generate data movements also...

the "tail" variable in  "tail1:=tail;" generates data movements also

and the tail variable in "tail=lasttail" generates data movement...


So notice with me please that even though  i am using a backoff mechanism
that is "if count>0 then for i:=0 to count*40 do asm pause end;"
there is 4 places before the backoff that generate data movements
between the caches or between the memory subsystem and the caches,
so imagines that we are in a high contention with 4 threads on  a 
Quadcore processor, and those threads are executing at the same time
the pop() method , so since the 3 variables and the 
"getObject(lastTail)" before the backoff do generate data movements 
beween the caches or between the memory and the local caches, when the 4 
threads will enter in parallel the pop() method those data movements 
that are serialized through the  "Bus"
will step one over the other and will cause contention(on the Bus) that 
higher the waiting time, and lower the throughtput, but since we have
3 variables and "getObject(lastTail)"  that are generating contention
just before the backoff, the contention will be a factor only
for a small number of threads on a small number of cores, when
there will be more cores and more threads the contention
will be less a factor in my algorithm and the algorithm will
score a better throughput and will equal that of the Chriss
Thomasson algorithm... since the Chriss Thomasson algorithm uses
less variables than my algorithm, 3 in total, that generate data 
movements between caches or from the memory susbsystem and the caches ,
and this has been a factor cause less variables means less contention
and less waiting time on a small number of threads and on a small number 
of core,   so the Chriss Thomasson algorithm has scored 33% more 
throughput than my algorithm , this is true only when there is fewer 
threads on fewer cores, as i have just explained, but as soon
as you run my algorithm with mores threads on more and more cores
, my algorithm will score better throughput and will equal that of the 
Chriss Thomasson algorithm.



Hope you have understood what i wanted you to understand.



Thank you,
Amine Moulay Ramdane.

Back to comp.lang.pascal.misc | Previous | Next — Next in thread | Find similar

Thread

We have to be smarter than that aminer <aminer@toto.net> - 2014-04-27 22:11 -0700
  Re: We have to be smarter than that aminer <aminer@toto.net> - 2014-04-27 23:17 -0700
  Re: We have to be smarter than that aminer <aminer@toto.net> - 2014-04-27 23:22 -0700

csiph-web