Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.pascal.misc > #752
| From | aminer <aminer@toto.net> |
|---|---|
| Newsgroups | comp.lang.pascal.misc |
| Subject | We have to be smarter than that |
| Date | 2014-04-27 22:11 -0700 |
| Organization | albasani.net |
| Message-ID | <ljkdcf$eso$3@news.albasani.net> (permalink) |
Hello all,
We have to be smarter than that...
Today i have isolated the cause have caused the 33% less
throughput in my algorithm than the Chriss Thomasson algorithm that
uses the bakery algorithm...
So follow with me cause it's very important to understand what
i will explain to you.. as i have told you before, the PH.Ds papers
about scalable algorithms have a kind of weakness in them, cause when
they analyze there algorithms from the point of view of the
"performance"they don't do it efficiently, by efficient i mean they have
to spot all the variables that generate data movements between
the caches or from the memory subsystem and the local caches, this
is how i have understood what is the cause of the 33% less throughput
in my concurrent FIFO Queue compared to the Chriss Thomasson concurrent
Queue.. so follow with me please...
I have said yesterday that it was not the "factor" that we call
"contention" that caused the drop in the throughput, and i have
explained to you this by saying that i was using an efficient backoff
mechanism , but i have thought more and more about this subject, and i
think that even though i am using a backoff mechanism there is still
contention there when we are using a few number of threads on a few
number of cores... so follow with me i will explain to you my findings...
Here is my algorithm:
http://pages.videotron.com/aminer/CQueue2.htm
I have said that we have to spot all the variables that generate
data movements between the cache and between the memory subsystem
and the cache, why ? cause those data movement can cause contention
that can higher the waiting time and lower the throughput...
So look at the following code of my pop() method:
==
function TWQueue.pop(var obj:long):boolean;
var lastTail,newtemp,tail1,head1,count: long;
i:integer;
begin
result:=true;
newTemp:=LockedIncLong(temp);
lastTail:=newtemp-1;
repeat
if fcount1^[lastTail and fMask].flag=1 then break;
sleep(0);
until (false);
obj:=getObject(lastTail);
repeat
head1:=lastTail;
tail1:=tail;
if head1 < tail1
then count:= (High(long)-tail1)+(1+head1)
else count:=(head1-tail1);TTicketspinlock
if count>0 then for i:=0 to count*40 do asm pause end;
if tail=lasttail
then
begin
fcount1^[lastTail and fMask].flag:=0;
tail:=newtemp;
exit;
end;
sleep(0);
until false;
end;
==
So the variables that generate data movements between the caches
and the memory system and the local caches are:
The "temp" variable in the "LockedIncLong(temp)" do generate data
movements also...
The "fcount1^[lastTail and fMask].flag" variable in the
"if fcount1^[lastTail and fMask].flag=1 then break;" do generate data
movements also...
The getObject(lastTail) do generate data movements also...
the "tail" variable in "tail1:=tail;" generates data movements also
and the tail variable in "tail=lasttail" generates data movement...
So notice with me please that even though i am using a backoff mechanism
that is "if count>0 then for i:=0 to count*40 do asm pause end;"
there is 4 places before the backoff that generate data movements
between the caches or between the memory subsystem and the caches,
so imagines that we are in a high contention with 4 threads on a
Quadcore processor, and those threads are executing at the same time
the pop() method , so since the 3 variables and the
"getObject(lastTail)" before the backoff do generate data movements
beween the caches or between the memory and the local caches, when the 4
threads will enter in parallel the pop() method those data movements
that are serialized through the "Bus"
will step one over the other and will cause contention(on the Bus) that
higher the waiting time, and lower the throughtput, but since we have
3 variables and "getObject(lastTail)" that are generating contention
just before the backoff, the contention will be a factor only
for a small number of threads on a small number of cores, when
there will be more cores and more threads the contention
will be less a factor in my algorithm and the algorithm will
score a better throughput and will equal that of the Chriss
Thomasson algorithm... since the Chriss Thomasson algorithm uses
less variables than my algorithm, 3 in total, that generate data
movements between caches or from the memory susbsystem and the caches ,
and this has been a factor cause less variables means less contention
and less waiting time on a small number of threads and on a small number
of core, so the Chriss Thomasson algorithm has scored 33% more
throughput than my algorithm , this is true only when there is fewer
threads on fewer cores, as i have just explained, but as soon
as you run my algorithm with mores threads on more and more cores
, my algorithm will score better throughput and will equal that of the
Chriss Thomasson algorithm.
Hope you have understood what i wanted you to understand.
Thank you,
Amine Moulay Ramdane.
Back to comp.lang.pascal.misc | Previous | Next — Next in thread | Find similar
We have to be smarter than that aminer <aminer@toto.net> - 2014-04-27 22:11 -0700 Re: We have to be smarter than that aminer <aminer@toto.net> - 2014-04-27 23:17 -0700 Re: We have to be smarter than that aminer <aminer@toto.net> - 2014-04-27 23:22 -0700
csiph-web