Re: About Lockfree_mpmc and scalability ...

From	"aminer" <aminer@videotron.ca>
Newsgroups	comp.programming.threads, comp.programming, comp.arch
Subject	Re: About Lockfree_mpmc and scalability ...
Date	2012-05-29 15:23 -0500
Organization	A noiseless patient Spider
Message-ID	<jq37om$rr2$1@dont-email.me> (permalink)
References	<jq0kn9$oe3$1@dont-email.me> <jq36i1$jea$1@dont-email.me>

Cross-posted to 3 groups.

Show all headers | View raw

I have corrected some typos , please read again...

Hello,


I have received the benchmarks from some persons
that have an L3 cache, and i have noticed that lockfree_mpmc
doesn't scale either on with an L3 cache.
Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;

until false;


end;

---


You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see in [1] temp has to be
loaded from the L3 cache on computers that have an L3 cache ,
but on my  Intel Core 2 Quad Q6600 that doesn't have an
L3 cache(just an L2 cache for every two cores)  i think it has to
be loaded  from memory, so that will make  the four thread test
with an L3 cache a little bit slower than the single thread version
that loads the values from the L1 cache and much slower on a computer
without an L3 cache. That's the same for [2] , tail has to be loaded the
same way.

It's why i am getting a retrograde throughput with four threads
on my Intel Core 2 Quad Q6600 and almost the same thoughput
as the single thread on a computer with an L3 cache.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..


Thank you.

Amine Moulay Ramdane.


"aminer" <aminer@videotron.ca> wrote in message 
news:jq36i1$jea$1@dont-email.me...
>
> Hello,
>
>
> I have receaived the benchmarks from some persons
> that have an L3 cache, and i have noticed that lockfree_mpmc
> doesn't scale either on with an L3 cache.
> Do you know why this lock free fifo doesn't scale, cause
> look at the following code on the push() side:
>
> --
>
> function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
> var lasttail,newtemp:longword;
> i,j:integer;
> begin
>
> if getlength >= fsize
> then
> begin
> result:=false;
> exit;
> end;
> result:=true;
> newTemp:=LockedIncLong(temp);
>
> lastTail:=newTemp-1;
> setObject(lastTail,tm);
>
> repeat
>
> if CAS(tail,lasttail,newtemp)
> then
> begin
> exit;
> end;
> asm pause end;
>
> until false;
>
>
> end;
>
> ---
>
>
> You have two thinks:
>
> [1] newTemp:=LockedIncLong(temp);
>
> [2] CAS(tail,lasttail,newtemp)
>
> In the 4 threads scenario , as you can see
> in [1] temp has to be loaded from the L3 cache
> of the other cores on computers that have an L3 cache
> but on my also from memory on my Intel Core 2 Quad Q6600
> that doesn't have an L2 cache(just an L2 cache for every two cores) ,
> so that will make the the four thread test with an L3 cache a little bit
> slower than the single thread version and much slower without an
> L3 cache compared to the single thread version that loads the values
> from the L1 cache. That's the same for [2] , tail has to be loaded the 
> same
> way.
>
> It's why i am getting a retrograde throughput on my
> Intel Core 2 Quad Q6600 and alomost the same thoughput
> as the single thread on a computer with an L3 cache.
>
> In the two thread scenario, you have to do a load
> from the local L2 cache in [1] and [2] and this loads makes
> the S part of the Amadahl equation much bigger than
> the P part, it's why the two threads version doens't scale
> either.
>
> So in general i think it's not possible to make lockfree
> fifo queues to scale when the lockfree code is sharing variables
> between the cores, cause sharing variables is so expensive..
>
>
> Thank you.
>
> Amine Moulay Ramdane.
>
>
> "aminer" <aminer@videotron.ca> wrote in message 
> news:jq0kn9$oe3$1@dont-email.me...
>>
>> Hello all,
>>
>>
>> I have finally found why lockfree_mpmc doesn't scale...
>>
>> you can get the the source code of lockfree_mpmc from:
>>
>> http://pages.videotron.com/aminer/
>>
>> So please follow with me..
>>
>> If you take a look at lockfree_mpmc object pascal
>> source code you will read this on the push side:
>>
>>
>> ---
>>
>> function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
>> var lasttail,newtemp:longword;
>> i,j:integer;
>> begin
>>
>> if getlength >= fsize
>> then
>>  begin
>>    result:=false;
>>   exit;
>> end;
>>
>> result:=true;
>>
>> newTemp:=LockedIncLong(temp);
>> lastTail:=newTemp-1;
>>
>> setObject(lastTail,tm);
>>
>> repeat
>> if CAS(tail,lasttail,newtemp)
>> then
>>   begin
>>    exit;
>>   end;
>> asm pause end;
>> until false;
>> end;
>>
>> ---
>>
>> When i have tested the push() side with 4 threads i have noticed that 
>> lockfree_mpmc
>> doesn't scale at all., in fact i have got a retrograde throughput, that 
>> means that
>> i got less throughput than on a single thread  test.. and i have finally 
>> found
>> why lockfree_mpmc doesn't scale.  When you are using a lockfree_mpmc
>> on a single thread test the CAS does read and update the variables on the
>> level 1 cache, and it's fast, but when you are using 4 threads it does 
>> get
>> too slow cause we are reading and updating from the L2  and from the 
>> memory.
>>
>> I have thried to play with the affinity mask and i have found that when i 
>> am
>> using two threads on my tests and reading and updating from the same 
>> level 2 cache
>> it does scale a little bit more and i have got more throughput with two 
>> threads
>> on different cores and on the same level 2 cache than the single 
>> threadtest.
>>
>>
>> I have also modified lockfree_mpmc to not touch the CAS and
>> the cache when tail and lasttail are not equal by using the following 
>> code inside
>> the repeat until loop:
>>
>> if tail <> lasttail
>> then
>> begin
>> continue;
>> end;
>>
>> and it does give  better  performance with this method
>>
>> here is the final code of the push() side of lockfree_mpmc..
>>
>> i think i will modify the pop() side like that...
>>
>>
>> ---
>> function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
>> var lasttail,newtemp:longword;
>> i,j:integer;
>> begin
>>
>> if getlength >= fsize
>> then
>>  begin
>>    result:=false;
>>   exit;
>> end;
>>
>> result:=true;
>>
>> newTemp:=LockedIncLong(temp);
>> lastTail:=newTemp-1;
>>
>> setObject(lastTail,tm);
>>
>> repeat
>>
>> if tail <> lasttail
>> then
>> begin
>>  continue;
>> end;
>>
>> if CAS(tail,lasttail,newtemp)
>> then
>>   begin
>>    exit;
>>   end;
>> asm pause end;
>> until false;
>> end;
>> ---
>>
>> But as i have said before lockfree_mpmc doesn't scale when we are
>> using different cores and WE ARE NOT  sharing the same cache,
>> that means that on my Intel Core 2 Quad Q6600 it does scale only
>> when we are using 2 threads on different cores that shares the same
>> level2 cache.
>>
>>
>>
>> Thank you.
>>
>>
>> Amine Moulay Ramdane.
>>
>
>

Back to comp.programming | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 15:46 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 16:39 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 18:54 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:03 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:06 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:23 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:06 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:20 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:45 -0500
      Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 18:07 -0500

csiph-web