Groups > comp.programming.threads > #1071

Re: CAS operations and scalability...

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From	"aminer" <aminer@videotron.ca>
Newsgroups	comp.programming, comp.programming.threads
Subject	Re: CAS operations and scalability...
Date	Thu, 13 Sep 2012 19:13:40 -0500
Organization	A noiseless patient Spider
Lines	154
Message-ID	<k2tpbb$9le$1@dont-email.me> (permalink)
References	<k1dira$jt9$1@dont-email.me>
Injection-Date	Thu, 13 Sep 2012 23:13:48 +0000 (UTC)
Injection-Info	mx04.eternal-september.org; posting-host="c43ca82f9e8d62a602307fe9d2e9b807"; logging-data="9902"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18R900qxKc8yT2kZ4vvh01u"
X-MimeOLE	Produced By Microsoft MimeOLE V6.00.2900.5512
X-RFC2646	Format=Flowed; Response
X-Newsreader	Microsoft Outlook Express 6.00.2900.5512
Cancel-Lock	sha1:13wZw1Nli/hwJF0nZzyYk7EKOTY=
X-Priority	3
X-MSMail-Priority	Normal
Xref	csiph.com comp.programming:2213 comp.programming.threads:1071

Cross-posted to 2 groups.

Show key headers only | View raw

I wrote:
> And about lockfree_mpmc( a lockfree fifo queue), i have done some tests
> and it's not scaling cause when you are using a single thread some 
> variables
> are updated locally on the L1 cache but using multiple threads those
> variables are
> loaded from the L2 cache and it's more expensive to load them from the L2
> cache.and this does generate much more contention

I mean that it's more expensive to load them from the L2 cache than the L1
cache.

And i wrote: "it does generate a lot of contention", i mean the CAS is a
single point
of access that can generate a lot of contention, in parallelhashlist i am
using
lock striping and distributing the access to many points(ea: many rwlocks,
many CASes)
and this is lowering the contention and this is better for scalability.

What are doing Dmitry Vyukov in his distributed rwlock is also distributing
thje access to many rwlocks using the same method as lock-striping and this
is
lowering the contention and this is better for scalability , it's why my
parallelhashlist
and the Distributed Reader-Writer Mutex 1.03 are scaling very well.

And if you have noticed , there is still a weakness with the Dmitry Vyukov
C++ Distributed Reader-Writer Mutex, cause since he is using
GetCurrentProcessorNumber() he is limiting the array of rwlocks to the
number of avaiblable cores and this is not good i think , cause if you have
many more threads than the avaiblable cores and there is high contention
this will cause the performance to degrade, so i have decided to change
that in my implementation of the Distributed Reader-Writer Mutex 1.03
and i have used a  variable number of rwlocks/MREWs so that you can lower
more the contention and this is better for scalability.



Thank you,
Amine Moulay Ramdane.


"aminer" <aminer@videotron.ca> wrote in message 
news:k1dira$jt9$1@dont-email.me...
>
> Hello,
>
>
> When the CAS operation "goes on the bus", use of CAS can impair 
> scalability.
> but CAS can be accomplished locally -- that is, with no bus 
> transactions --
> and then it can scale.
>
> If we then change the CAS operation that goes on the bus to a normal store
> you'll also see a similar slow-down in terms of coherency bus traffic, CAS
> isn't appreciably different than a normal store. Also the lock: prefix 
> caused
> the LOCK# signal to be asserted, acquiring exclusive access to the bus.
> This doesn't scale of course
>
> As you have noticed i  have wrote parallelhashlist (a parallel hashtable),
> you can find parallelhashlist here:
>
> http://pages.videotron.com/aminer/
>
> It's a parallel Hashtable with O(1) best case and O(log(n)) worst case
> access that uses lock striping and lightweight MREWs(multiple-readers
> -exclusive-writer) , this allows multiple threads to write and read
> concurently. also parallelhashlist maintains an independant counter , that
> counts the number of entries , for each segment of the hashtable and uses
> a lock for each counter, this is also for better scalability. and 
> parallelhashlist
> is scaling very well,  but since it is a parallel hashtable so the 
> possibility of
> contention is low so why doi need the distributed reader-writer lock of
> Dmitry Vyukov inside my parallel hashlist ?
>
> Other than that I have done some tests with the lightweight MREW that i am
> using inside my parallelhashlist and i have done also some tests with my
> lockfree mpmc fifo queue and what i think is that the CAS is generating
> a lot of contention this is is why the lightweight MREW and my 
> lockfree_mpmc
> are not scaling , but parallelhashlist is scaling very well cause i am 
> using
> lock-striping that is lowering contention.
>
> What are doing Dmitry Vyukov in his distributed rwlock is lowering
> the contention using the same method as lock striping that i am using 
> inside
> parallelhashlist it is why it is scaling, but there is still a possibility
> of contention in his distributed rwlock that can cause a problem to the
> scalability if there is too many threads and not a sufficient number of
> rwlocks in the Dmitry distributed rwlock to be able to lower the 
> contention.
>
> I have tested parallelhashlist(a parallel hashtable that i have 
> implemented)
> with four threads on a quad core and it's giving a very well scaling on 
> both
> reads and writes.
>
> Also i have done some scalability tests on my parallelsort library and i 
> have
> come
> to the conclusion that parallel heapsort is better on scalability than
> parallel quicksort
> cause the P part (of the Amdahl equation) is bigger in parallel heapsort
> than in parallel
> quicksort, the parallel heapsort is doing more on the parallel part, it's
> why it scales better than parallel quicksort, but parallel quicksort is
> still
> faster than parallel heapsort and parallel merge sort on my tests on a
> quad core processor.
>
> And about lockfree_mpmc( a lockfree fifo queue), i have done some tests
> and it's not scaling cause when you are using a single thread some 
> variables
> are updated locally on the L1 cache but using multiple threads those 
> variables are
> loaded from the L2 cache and it's more expensive to load them from the L2
> cache.and this does generate much more contention
>
> But even though lockfree_mpmc is not scalable, you can increase
> the P (parallel) part by doing more of the same: Increase the volume of
> data processed by the P part (and therefore the percentage p of time spent
> in computing). This is Gustafson's Law and you will get more scalability.
>
> For example i have used the IntToStr() function on each of the four 
> threads
> (on
> a quad core) on my lockfree_mpmc test programs to convert from and integer
> to a string, so i have increased the P (parallel) part and i have got more
> scalability,
> this is Gustafson's Law, and you have to remember Gustafson's Law ,
> this is very important.
>
>
> You can download my parallel libraries from
>
> http://pages.videotron.com/aminer/
>
>
>
>
> Sincerely,
> Amine Moulay Ramdane.
>
>
>
>

Back to comp.programming.threads | Previous | Next — Previous in thread | Find similar

Thread

CAS operations and scalability... "aminer" <aminer@videotron.ca> - 2012-08-26 12:28 -0500
  Re: CAS operations and scalability... "aminer" <aminer@videotron.ca> - 2012-08-26 12:56 -0500
  Re: CAS operations and scalability... Patricia Shanahan <pats@acm.org> - 2012-08-26 10:36 -0700
    Re: CAS operations and scalability... "aminer" <aminer@videotron.ca> - 2012-08-26 14:08 -0500
  Re: CAS operations and scalability... "aminer" <aminer@videotron.ca> - 2012-09-13 19:13 -0500

csiph-web