Re: About Lockfree_mpmc and scalability ...

From	"aminer" <aminer@videotron.ca>
Newsgroups	comp.programming.threads, comp.programming, comp.arch
Subject	Re: About Lockfree_mpmc and scalability ...
Date	2012-05-29 18:07 -0500
Organization	A noiseless patient Spider
Message-ID	<jq3hb1$osc$1@dont-email.me> (permalink)
References	<jq0kn9$oe3$1@dont-email.me> <jq37j1$qoh$1@dont-email.me> <jq391s$4j2$1@dont-email.me>

Cross-posted to 3 groups.

Show all headers | View raw

Hello,

Even the following code inside push() method:

if getlength >= fsize
 then
 begin
 result:=false;
 exit;
 end;

If you hav e noticed getlength() method is  sharing
variables between the cores and making the 4 threads
test much slower than the single thread test on
Intel Core 2 Quad Q6600, and i have tested it on my computer,
it makes it much slower cause the cache to cache tranfer is costly
on Intel Core 2 Quad Q6600, but on new architechtures that have
and L3 cache and hypertransport, it gives the same throughput
on single thread and  four threads but it doesn't scale with four threads.
 .
So the following parts are sharing variables between the cores and making
the 4 threads test much slower than the single thread test.on Intel Core 2 
Quad Q6600:

[1]  getlength()

[1] newTemp:=LockedIncLong(temp);  ...

[2] and CAS(tail,lasttail,newtemp) also...


It's why i have told you that the CAS  and sharing variables
are expensive.



Thank you.

Amine Moulay Ramdane.





"aminer" <aminer@videotron.ca> wrote in message 
news:jq391s$4j2$1@dont-email.me...
>I wrote:
>> In the two thread scenario, you have to do a load
>> from the local L2 cache in [1] and [2] and this loads makes
>> the S part of the Amadahl equation much bigger than
>> the P part, it's why the two threads version doens't scale
>> either.
>>
>> So in general i think it's not possible to make lockfree
>> fifo queues to scale when the lockfree code is sharing variables
>> between the cores, cause sharing variables is so expensive..
>
>
> I mean the CAS and the sharing of the variables make the S part of
> lockfree_mpmc much bigger than the S part and from the Amadahl equestion
> this makes lockfree_mpmc not scalable.
>
> So in general i think it's not possible to make lockfree
> fifo queues to scale when the lockfree code is sharing variables
> between the cores and you are using  CASes, cause sharing variables
> and using CAS are so expensive..
>
>
> Thank you.
>
> Amine Moulay Ramdane.
>
>
>
>
>
>
> "aminer" <aminer@videotron.ca> wrote in message 
> news:jq37j1$qoh$1@dont-email.me...
>>
>>
>> I have corrected some typos , please read again...
>>
>> Hello,
>>
>>
>> I have received the benchmarks from some persons
>> that have an L3 cache, and i have noticed that lockfree_mpmc
>> doesn't scale either on with an L3 cache.
>> Do you know why this lock free fifo doesn't scale, cause
>> look at the following code on the push() side:
>>
>> --
>>
>> function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
>> var lasttail,newtemp:longword;
>> i,j:integer;
>> begin
>>
>> if getlength >= fsize
>> then
>> begin
>> result:=false;
>> exit;
>> end;
>> result:=true;
>> newTemp:=LockedIncLong(temp);
>>
>> lastTail:=newTemp-1;
>> setObject(lastTail,tm);
>>
>> repeat
>>
>> if CAS(tail,lasttail,newtemp)
>> then
>> begin
>> exit;
>> end;
>> asm pause end;
>>
>> until false;
>>
>>
>> end;
>>
>> ---
>>
>>
>> You have two thinks:
>>
>> [1] newTemp:=LockedIncLong(temp);
>>
>> [2] CAS(tail,lasttail,newtemp)
>>
>> In the 4 threads scenario , as you can see in [1] temp has to be
>> loaded from the L3 cache on computers that have an L3 cache ,
>> but on my  Intel Core 2 Quad Q6600 that doesn't have an
>> L3 cache(just an L2 cache for every two cores)  i think it has to
>> be loaded  from memory, so that will make  the four thread test
>> with an L3 cache a little bit slower than the single thread version
>> that loads the values from the L1 cache and much slower on a computer
>> without an L3 cache. That's the same for [2] , tail has to be loaded the
>> same way.
>>
>> It's why i am getting a retrograde throughput with four threads
>> on my Intel Core 2 Quad Q6600 and almost the same thoughput
>> as the single thread on a computer with an L3 cache.
>>
>> In the two thread scenario, you have to do a load
>> from the local L2 cache in [1] and [2] and this loads makes
>> the S part of the Amadahl equation much bigger than
>> the P part, it's why the two threads version doens't scale
>> either.
>>
>> So in general i think it's not possible to make lockfree
>> fifo queues to scale when the lockfree code is sharing variables
>> between the cores, cause sharing variables is so expensive..
>>
>>
>> Thank you.
>>
>> Amine Moulay Ramdane.
>>
>>
>>
>>
>> "aminer" <aminer@videotron.ca> wrote in message 
>> news:jq0kn9$oe3$1@dont-email.me...
>>>
>>> Hello all,
>>>
>>>
>>> I have finally found why lockfree_mpmc doesn't scale...
>>>
>>> you can get the the source code of lockfree_mpmc from:
>>>
>>> http://pages.videotron.com/aminer/
>>>
>>> So please follow with me..
>>>
>>> If you take a look at lockfree_mpmc object pascal
>>> source code you will read this on the push side:
>>>
>>>
>>> ---
>>>
>>> function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
>>> var lasttail,newtemp:longword;
>>> i,j:integer;
>>> begin
>>>
>>> if getlength >= fsize
>>> then
>>>  begin
>>>    result:=false;
>>>   exit;
>>> end;
>>>
>>> result:=true;
>>>
>>> newTemp:=LockedIncLong(temp);
>>> lastTail:=newTemp-1;
>>>
>>> setObject(lastTail,tm);
>>>
>>> repeat
>>> if CAS(tail,lasttail,newtemp)
>>> then
>>>   begin
>>>    exit;
>>>   end;
>>> asm pause end;
>>> until false;
>>> end;
>>>
>>> ---
>>>
>>> When i have tested the push() side with 4 threads i have noticed that 
>>> lockfree_mpmc
>>> doesn't scale at all., in fact i have got a retrograde throughput, that 
>>> means that
>>> i got less throughput than on a single thread  test.. and i have finally 
>>> found
>>> why lockfree_mpmc doesn't scale.  When you are using a lockfree_mpmc
>>> on a single thread test the CAS does read and update the variables on 
>>> the
>>> level 1 cache, and it's fast, but when you are using 4 threads it does 
>>> get
>>> too slow cause we are reading and updating from the L2  and from the 
>>> memory.
>>>
>>> I have thried to play with the affinity mask and i have found that when 
>>> i am
>>> using two threads on my tests and reading and updating from the same 
>>> level 2 cache
>>> it does scale a little bit more and i have got more throughput with two 
>>> threads
>>> on different cores and on the same level 2 cache than the single 
>>> threadtest.
>>>
>>>
>>> I have also modified lockfree_mpmc to not touch the CAS and
>>> the cache when tail and lasttail are not equal by using the following 
>>> code inside
>>> the repeat until loop:
>>>
>>> if tail <> lasttail
>>> then
>>> begin
>>> continue;
>>> end;
>>>
>>> and it does give  better  performance with this method
>>>
>>> here is the final code of the push() side of lockfree_mpmc..
>>>
>>> i think i will modify the pop() side like that...
>>>
>>>
>>> ---
>>> function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
>>> var lasttail,newtemp:longword;
>>> i,j:integer;
>>> begin
>>>
>>> if getlength >= fsize
>>> then
>>>  begin
>>>    result:=false;
>>>   exit;
>>> end;
>>>
>>> result:=true;
>>>
>>> newTemp:=LockedIncLong(temp);
>>> lastTail:=newTemp-1;
>>>
>>> setObject(lastTail,tm);
>>>
>>> repeat
>>>
>>> if tail <> lasttail
>>> then
>>> begin
>>>  continue;
>>> end;
>>>
>>> if CAS(tail,lasttail,newtemp)
>>> then
>>>   begin
>>>    exit;
>>>   end;
>>> asm pause end;
>>> until false;
>>> end;
>>> ---
>>>
>>> But as i have said before lockfree_mpmc doesn't scale when we are
>>> using different cores and WE ARE NOT  sharing the same cache,
>>> that means that on my Intel Core 2 Quad Q6600 it does scale only
>>> when we are using 2 threads on different cores that shares the same
>>> level2 cache.
>>>
>>>
>>>
>>> Thank you.
>>>
>>>
>>> Amine Moulay Ramdane.
>>>
>>
>>
>
>

Back to comp.programming | Previous | Next — Previous in thread | Find similar

Thread

About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 15:46 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 16:39 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 18:54 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:03 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:06 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:23 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:06 -0500
  Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:20 -0500
    Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:45 -0500
      Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 18:07 -0500

csiph-web