About ParallelFor() and the grainsize..

From	Wisdom90 <d@d.d>
Newsgroups	comp.lang.pascal.misc
Subject	About ParallelFor() and the grainsize..
Date	2019-12-31 17:31 -0500
Organization	A noiseless patient Spider
Message-ID	<qugi8b$an5$2@dont-email.me> (permalink)

Show all headers | View raw

Hello,


About ParallelFor() and the grainsize..

I have just read the following web page:

Parallelization: Harder than it looks

https://www.jayconrod.com/posts/29/parallelization--harder-than-it-looks


Notice that MIT's Cilk is using a divide and conquer approach to 
calculate the grainsize for the Parallel For, here it is:

-------
void run_loop(first, last) {
     if (last - first < grainsize) {
         for (int i = first; i < last; ++i) LOOP_BODY;
     } else {
         int mid = (last-first)/2;
         cilk_spawn run_loop(first, mid);
         run_loop(mid, last);
     }
}
-------


But as you are noticing if i do a simulation of it by
running my following Delphi program:

----------------------

program test;


var c,d:uint64;

begin


c:=high(uint64);

d:=0;

repeat

c:=c div 2;
d:=d+1;

until c<=1000;

writeln(c);
writeln(d);

end.

-------


So as you are noticing for a grainsize of 1000 the above Delphi program 
gives 511, that means that the Cilk's divide and conquer approach to 
calculate the grainsize for the Parallel For is "not" good.


This is why you have to take a look at my Threadpool engine with 
priorities that scales very well that is really powerful because it 
scales very well on multicore and NUMA systems, also it comes with a 
ParallelFor() that scales very well on multicores and NUMA systems, and 
take a look at its source code to notice i am calculating much more 
precisely and correctly the grainsize for ParallelFor() than the Cilk's 
divide and conquer to calculate the grainsize that is "not" good.


Read the rest:

Today i will talk about data dependency and parallel loops..

For a loop to be parallelized, every iteration must be independent of 
the others, one way to be sure of it is to execute the loop
in the direction of the incremented index of the loop and in the 
direction of the decremented index of the loop and verify if the results 
are the same. A data dependency happens if memory is modified: a loop 
has a data dependency if an iteration writes a variable that is read or 
write in another iteration of the loop. There is no data dependency if 
only one iteration reads or writes a variable or if many iterations read
the same variable without modifying it. So this is the "general" "rules".

Now there remains to know that you have for example to know how to 
construct the parallel for loop if there is an induction variable or if 
there is a reduction operation, i will give an example of them:

If we have the following (the code looks like Algol or modern Object 
Pascal):


IND:=0

For I:=1 to N
  Do
   Begin
     IND := IND + 1;
     A[I]:=B[IND];
   End;


So as you are noticing since IND is an induction variable , so
to parallelize the loop you have to do the following:


For I:=1 to N
  Do
   Begin
    IND:=(I*(I+1))/2;
    A[I]:=B[IND];
   End;


Now for the reduction operation example, you will notice that my 
invention that is my Threadpool with priorities that scales very well (
read about it below) supports a Parallel For that scales very well that 
supports "grainsize", and you will notice that the grainsize can be used 
in the ParallelFor() with a reduction operation and you will notice that 
my following powerful scalable Adder is also used in this scenario, here 
it is:

https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal


So here is the example with a reduction operation in modern Object Pascal:


TOTAL:=0.0
For I := 1 to N
Do
  Begin
    TOTAL:=TOTAL+A[I]
  End;


So with my powerful scalable Adder and with my powerful invention that 
is my ParallelFor() that scales very well, you will parallelize the 
above like this:

procedure test1(j:integer;ptr:pointer);
begin

t.add(A[J]); // "t" is my scalable Adder object

end;

// Let's suppose that N is 100000
// In the following, 10000 is the grainsize

obj.ParallelFor(1,N,test1,10000,pointer(0));

TOTAL:=T.get();


And read the following to understand how to use grainsize of my Parallel 
for that scales well:


About my ParallelFor() that scales very well that uses my efficient 
Threadpool that scales very well:

With ParallelFor() you have to:

1- Ensure Sufficient Work

Each iteration of a loop involves a certain amount of work,
so you have to ensure a sufficient amount of the work,
read below about "grainsize" that i have implemented.

2- In OpenMP we have that:

Static and Dynamic Scheduling

One basic characteristic of a loop schedule is whether it is static or 
dynamic:

• In a static schedule, the choice of which thread performs a particular
iteration is purely a function of the iteration number and number of
threads. Each thread performs only the iterations assigned to it at the
beginning of the loop.

• In a dynamic schedule, the assignment of iterations to threads can
vary at runtime from one execution to another. Not all iterations are
assigned to threads at the start of the loop. Instead, each thread
requests more iterations after it has completed the work already
assigned to it.

But with my ParallelFor() that scales very well, since it is using my 
efficient Threadpool that scales very well, so it is using Round-robin 
scheduling and it uses also work stealing, so i think that this is 
sufficient.

Read the rest:

My Threadpool engine with priorities that scales very well is really 
powerful because it scales very well on multicore and NUMA systems, also 
it comes with a ParallelFor() that scales very well on multicores and 
NUMA systems.

You can download it from:

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well


Here is the explanation of my ParallelFor() that scales very well:

I have also implemented a ParallelFor() that scales very well, here is 
the method:

procedure ParallelFor(nMin, nMax:integer;aProc: 
TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);

nMin and nMax parameters of the ParallelFor() are the minimum and 
maximum integer values of the variable of the ParallelFor() loop, aProc 
parameter of ParallelFor() is the procedure to call, and GrainSize 
integer parameter of ParallelFor() is the following:

The grainsize sets a minimum threshold for parallelization.

A rule of thumb is that grainsize iterations should take at least 
100,000 clock cycles to execute.

For example, if a single iteration takes 100 clocks, then the grainsize 
needs to be at least 1000 iterations. When in doubt, do the following 
experiment:

1- Set the grainsize parameter higher than necessary. The grainsize is 
specified in units of loop iterations.

If you have no idea of how many clock cycles an iteration might take, 
start with grainsize=100,000.

The rationale is that each iteration normally requires at least one 
clock per iteration. In most cases, step 3 will guide you to a much 
smaller value.

2- Run your algorithm.

3- Iteratively halve the grainsize parameter and see how much the 
algorithm slows down or speeds up as the value decreases.

A drawback of setting a grainsize too high is that it can reduce 
parallelism. For example, if the grainsize is 1000 and the loop has 2000 
iterations, the ParallelFor() method distributes the loop across only 
two processors, even if more are available.

And you can pass a parameter in Ptr as pointer to ParallelFor(), and you 
can set pmode parameter of to pmBlocking so that ParallelFor() is 
blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and 
the Priority parameter is the priority of ParallelFor(). Look inside the 
test.pas example to see how to use it.


Thank you,
Amine Moulay Ramdane.

Back to comp.lang.pascal.misc | Previous | Next | Find similar

Thread

About ParallelFor() and the grainsize.. Wisdom90 <d@d.d> - 2019-12-31 17:31 -0500

csiph-web