Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.programming > #1661
| From | "aminer" <aminer@videotron.ca> |
|---|---|
| Newsgroups | comp.programming.threads, comp.programming, comp.arch |
| Subject | Re: About Lockfree_mpmc and scalability ... |
| Date | 2012-05-29 15:23 -0500 |
| Organization | A noiseless patient Spider |
| Message-ID | <jq37om$rr2$1@dont-email.me> (permalink) |
| References | <jq0kn9$oe3$1@dont-email.me> <jq36i1$jea$1@dont-email.me> |
Cross-posted to 3 groups.
I have corrected some typos , please read again... Hello, I have received the benchmarks from some persons that have an L3 cache, and i have noticed that lockfree_mpmc doesn't scale either on with an L3 cache. Do you know why this lock free fifo doesn't scale, cause look at the following code on the push() side: -- function TLockfree_MPMC.push(tm : tNodeQueue):boolean; var lasttail,newtemp:longword; i,j:integer; begin if getlength >= fsize then begin result:=false; exit; end; result:=true; newTemp:=LockedIncLong(temp); lastTail:=newTemp-1; setObject(lastTail,tm); repeat if CAS(tail,lasttail,newtemp) then begin exit; end; asm pause end; until false; end; --- You have two thinks: [1] newTemp:=LockedIncLong(temp); [2] CAS(tail,lasttail,newtemp) In the 4 threads scenario , as you can see in [1] temp has to be loaded from the L3 cache on computers that have an L3 cache , but on my Intel Core 2 Quad Q6600 that doesn't have an L3 cache(just an L2 cache for every two cores) i think it has to be loaded from memory, so that will make the four thread test with an L3 cache a little bit slower than the single thread version that loads the values from the L1 cache and much slower on a computer without an L3 cache. That's the same for [2] , tail has to be loaded the same way. It's why i am getting a retrograde throughput with four threads on my Intel Core 2 Quad Q6600 and almost the same thoughput as the single thread on a computer with an L3 cache. In the two thread scenario, you have to do a load from the local L2 cache in [1] and [2] and this loads makes the S part of the Amadahl equation much bigger than the P part, it's why the two threads version doens't scale either. So in general i think it's not possible to make lockfree fifo queues to scale when the lockfree code is sharing variables between the cores, cause sharing variables is so expensive.. Thank you. Amine Moulay Ramdane. "aminer" <aminer@videotron.ca> wrote in message news:jq36i1$jea$1@dont-email.me... > > Hello, > > > I have receaived the benchmarks from some persons > that have an L3 cache, and i have noticed that lockfree_mpmc > doesn't scale either on with an L3 cache. > Do you know why this lock free fifo doesn't scale, cause > look at the following code on the push() side: > > -- > > function TLockfree_MPMC.push(tm : tNodeQueue):boolean; > var lasttail,newtemp:longword; > i,j:integer; > begin > > if getlength >= fsize > then > begin > result:=false; > exit; > end; > result:=true; > newTemp:=LockedIncLong(temp); > > lastTail:=newTemp-1; > setObject(lastTail,tm); > > repeat > > if CAS(tail,lasttail,newtemp) > then > begin > exit; > end; > asm pause end; > > until false; > > > end; > > --- > > > You have two thinks: > > [1] newTemp:=LockedIncLong(temp); > > [2] CAS(tail,lasttail,newtemp) > > In the 4 threads scenario , as you can see > in [1] temp has to be loaded from the L3 cache > of the other cores on computers that have an L3 cache > but on my also from memory on my Intel Core 2 Quad Q6600 > that doesn't have an L2 cache(just an L2 cache for every two cores) , > so that will make the the four thread test with an L3 cache a little bit > slower than the single thread version and much slower without an > L3 cache compared to the single thread version that loads the values > from the L1 cache. That's the same for [2] , tail has to be loaded the > same > way. > > It's why i am getting a retrograde throughput on my > Intel Core 2 Quad Q6600 and alomost the same thoughput > as the single thread on a computer with an L3 cache. > > In the two thread scenario, you have to do a load > from the local L2 cache in [1] and [2] and this loads makes > the S part of the Amadahl equation much bigger than > the P part, it's why the two threads version doens't scale > either. > > So in general i think it's not possible to make lockfree > fifo queues to scale when the lockfree code is sharing variables > between the cores, cause sharing variables is so expensive.. > > > Thank you. > > Amine Moulay Ramdane. > > > "aminer" <aminer@videotron.ca> wrote in message > news:jq0kn9$oe3$1@dont-email.me... >> >> Hello all, >> >> >> I have finally found why lockfree_mpmc doesn't scale... >> >> you can get the the source code of lockfree_mpmc from: >> >> http://pages.videotron.com/aminer/ >> >> So please follow with me.. >> >> If you take a look at lockfree_mpmc object pascal >> source code you will read this on the push side: >> >> >> --- >> >> function TLockfree_MPMC.push(tm : tNodeQueue):boolean; >> var lasttail,newtemp:longword; >> i,j:integer; >> begin >> >> if getlength >= fsize >> then >> begin >> result:=false; >> exit; >> end; >> >> result:=true; >> >> newTemp:=LockedIncLong(temp); >> lastTail:=newTemp-1; >> >> setObject(lastTail,tm); >> >> repeat >> if CAS(tail,lasttail,newtemp) >> then >> begin >> exit; >> end; >> asm pause end; >> until false; >> end; >> >> --- >> >> When i have tested the push() side with 4 threads i have noticed that >> lockfree_mpmc >> doesn't scale at all., in fact i have got a retrograde throughput, that >> means that >> i got less throughput than on a single thread test.. and i have finally >> found >> why lockfree_mpmc doesn't scale. When you are using a lockfree_mpmc >> on a single thread test the CAS does read and update the variables on the >> level 1 cache, and it's fast, but when you are using 4 threads it does >> get >> too slow cause we are reading and updating from the L2 and from the >> memory. >> >> I have thried to play with the affinity mask and i have found that when i >> am >> using two threads on my tests and reading and updating from the same >> level 2 cache >> it does scale a little bit more and i have got more throughput with two >> threads >> on different cores and on the same level 2 cache than the single >> threadtest. >> >> >> I have also modified lockfree_mpmc to not touch the CAS and >> the cache when tail and lasttail are not equal by using the following >> code inside >> the repeat until loop: >> >> if tail <> lasttail >> then >> begin >> continue; >> end; >> >> and it does give better performance with this method >> >> here is the final code of the push() side of lockfree_mpmc.. >> >> i think i will modify the pop() side like that... >> >> >> --- >> function TLockfree_MPMC.push(tm : tNodeQueue):boolean; >> var lasttail,newtemp:longword; >> i,j:integer; >> begin >> >> if getlength >= fsize >> then >> begin >> result:=false; >> exit; >> end; >> >> result:=true; >> >> newTemp:=LockedIncLong(temp); >> lastTail:=newTemp-1; >> >> setObject(lastTail,tm); >> >> repeat >> >> if tail <> lasttail >> then >> begin >> continue; >> end; >> >> if CAS(tail,lasttail,newtemp) >> then >> begin >> exit; >> end; >> asm pause end; >> until false; >> end; >> --- >> >> But as i have said before lockfree_mpmc doesn't scale when we are >> using different cores and WE ARE NOT sharing the same cache, >> that means that on my Intel Core 2 Quad Q6600 it does scale only >> when we are using 2 threads on different cores that shares the same >> level2 cache. >> >> >> >> Thank you. >> >> >> Amine Moulay Ramdane. >> > >
Back to comp.programming | Previous | Next — Previous in thread | Next in thread | Find similar
About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 15:46 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 16:39 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-28 18:54 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:03 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:06 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:23 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:06 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:20 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 15:45 -0500
Re: About Lockfree_mpmc and scalability ... "aminer" <aminer@videotron.ca> - 2012-05-29 18:07 -0500
csiph-web