Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.forth > #21482

Re: Anonymous code/data/create-does

From mhx@iae.nl (Marcel Hendrix)
Subject Re: Anonymous code/data/create-does
Newsgroups comp.lang.forth
Message-ID <12038996998434@frunobulax.edu> (permalink)
Date 2013-04-07 14:00 +0200
References <77e6b08f-06d3-432d-90a7-ee3f6040e0ce@n2g2000yqg.googlegroups.com>
Organization Wanadoo

Show all headers | View raw


Alex McDonald <blog@rivadpm.com> write Re: Anonymous code/data/create-does

>On Apr 6, 5:04 pm, m...@iae.nl (Marcel Hendrix) wrote:
>> Alex McDonald <b...@rivadpm.com> writes Re: Anonymous code/data/create-does
>[.]
>> We are talking about a 64 x 64 float array and maybe 4 .. 8K of code.
>> Would that not fit in the caches?
>
>Yes, but dirty cachelines are what can cause the problem; there's only
>32KB of L1 cache on the 920 and each line is 64bytes. For an 8byte
>float (or are you using 10 bytes?) that's 32KB exactly, and if 2
>threads dirty every cacheline, that's 64KB.

I was using 10 bytes, but changing it to 8 bytes just now made only 
a tiny difference (10..20 ms).

Remember, there are only 4 threads and 4 cores available. Why would
there be a need to preempt a thread before it is finished? Cache trashing
will of course happen because PAR creates 4 new threads each iteration.

>I'm suspicious of your Linux figures; the Windows figures are what I
>would expect for small time slice thread schedule & a good bit of
>context switching along with cacheline issues. The Linux figures just
>look like threads are getting long time slices, are therefore not
>dirtying cachelines for other threads, and it's consequently just like
>the non-parallel runs.

I don't mind what the OS does, as long as the code runs (4 times) faster.
Again a longer/shorter time slice would not matter as long as there are 
not more threads than CPUs?

[..]

I tried a lot of things, but none of them worked. Priority tweaks 
slightly improved matters in that the best time (2 times faster) 
happens more often.

What did help is a change of algorithm: the thread runs over the partial 
matrix a number of times, not just once. I expected doing it this way
would be less efficient (needing more iterations), but it appears to 
works well for the parallel case. 

Here are the code changes for solve4:

#96 VALUE #iters
: iterate4 ( F: -- sum ) 
	0e TO sum
	PAR
	  STARTP  #iters 0 do altxt0 EXECUTE +TO sum loop  ENDP
	  STARTP  #iters 0 do altxt1 EXECUTE +TO sum loop  ENDP
	  STARTP  #iters 0 do altxt2 EXECUTE +TO sum loop  ENDP
	  STARTP  #iters 0 do altxt3 EXECUTE +TO sum loop  ENDP
	ENDPAR sum ; PRIVATE

The results change markedly:

| ( Windows 7 )
| solve0 : noname (1)       : after 4971 iterations and 0.551 seconds elapsed.
| solve1 : anons (1)        : after 4971 iterations and 0.520 seconds elapsed.
| solve2 : parallel anons   : after 4972 iterations and 1.250 seconds elapsed.
| solve3 : nonames          : after 4971 iterations and 0.172 seconds elapsed.
| solve4 : parallel nonames : after 4973 iterations and 1.243 seconds elapsed. ok

  solve0 : noname (1)       : after 4971 iterations and 0.533 seconds elapsed.
  solve1 : anons (1)        : after 4971 iterations and 0.511 seconds elapsed.
  solve2 : parallel anons   : after 110 iterations and 0.235 seconds elapsed.
  solve3 : nonames          : after 4971 iterations and 0.174 seconds elapsed.
  solve4 : parallel nonames : after 72 iterations and 0.098 seconds elapsed. ok

| ( Linux Ubuntu 12.4 )
| solve0 : noname (1)       : after 4971 iterations and 0.431 seconds elapsed.
| solve1 : anons (1)        : after 4971 iterations and 0.412 seconds elapsed.
| solve2 : parallel anons   : after 4972 iterations and 0.542 seconds elapsed.
| solve3 : nonames          : after 4971 iterations and 0.140 seconds elapsed.
| solve4 : parallel nonames : after 4973 iterations and 0.140 seconds elapsed. ok

( Linux Ubuntu 12.4 )
  solve0 : noname (1)       : after 4971 iterations and 0.431 seconds elapsed.
  solve1 : anons (1)        : after 4971 iterations and 0.412 seconds elapsed.
  solve2 : parallel anons   : after 108 iterations and 0.181 seconds elapsed.
  solve3 : nonames          : after 4971 iterations and 0.136 seconds elapsed.
  solve4 : parallel nonames : after 105 iterations and 0.088 seconds elapsed. ok

The unexplained difference between Windows and Linux seems to be gone (it even looks
like Windows is now faster :-), and the threaded version is about twice faster 
than the serial one.

Here is the performance for a 256x256 grid (instead of 64x64). Apparently
threadswitching of small tasks is more expensive than I thought. For this 
gridsize the task manager shows that 4 cores are running flat out. With the 
64x64 grid there is about 60% system time used.

FORTH> bench
solve0 : noname (1)       : after 81642 iterations and 145.836 seconds elapsed.
solve1 : anons (1)        : after 81642 iterations and 122.164 seconds elapsed.
solve2 : parallel anons   : after 1744 iterations and 54.847 seconds elapsed.
solve3 : nonames          : after 81642 iterations and 45.562 seconds elapsed.
solve4 : parallel nonames : after 1479 iterations and 21.092 seconds elapsed. ok

-marcel

Back to comp.lang.forth | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-02 23:46 +0200
  Re: Anonymous code/data/create-does "WJ" <w_a_x_man@yahoo.com> - 2013-04-02 22:51 +0000
  Re: Anonymous code/data/create-does "WJ" <w_a_x_man@yahoo.com> - 2013-04-02 23:00 +0000
  Re: Anonymous code/data/create-does Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2013-04-03 11:07 +0100
    Re: Anonymous code/data/create-does m.a.m.hendrix@tue.nl - 2013-04-03 06:53 -0700
      Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-03 13:43 -0700
        Re: Anonymous code/data/create-does Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2013-04-04 18:07 +0100
    Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-04 00:38 +0200
      Re: Anonymous code/data/create-does Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2013-04-04 18:18 +0100
        Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-04 20:29 +0200
          Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-06 20:40 +0200
            Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-06 14:09 -0700
              Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-07 02:04 +0200
                Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-06 20:34 -0700
                Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-07 14:00 +0200
                Re: Anonymous code/data/create-does kenney@cix.compulink.co.uk - 2013-04-08 02:38 -0500
                Re: Anonymous code/data/create-does Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-04-08 10:58 +0200
                Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-08 14:57 -0700
                Re: Anonymous code/data/create-does Bernd Paysan <bernd.paysan@gmx.de> - 2013-04-08 19:26 +0200
                Re: Anonymous code/data/create-does kenney@cix.compulink.co.uk - 2013-04-09 02:07 -0500
  Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-03 22:44 +0200
    Re: Anonymous code/data/create-does Lars Brinkhoff <lars.spam@nocrew.org> - 2013-04-04 12:42 +0200
      Re: Anonymous code/data/create-does Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-04-04 07:04 -0500
        Re: Anonymous code/data/create-does Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-04-04 07:07 -0500

csiph-web