Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #21482
| From | mhx@iae.nl (Marcel Hendrix) |
|---|---|
| Subject | Re: Anonymous code/data/create-does |
| Newsgroups | comp.lang.forth |
| Message-ID | <12038996998434@frunobulax.edu> (permalink) |
| Date | 2013-04-07 14:00 +0200 |
| References | <77e6b08f-06d3-432d-90a7-ee3f6040e0ce@n2g2000yqg.googlegroups.com> |
| Organization | Wanadoo |
Alex McDonald <blog@rivadpm.com> write Re: Anonymous code/data/create-does >On Apr 6, 5:04 pm, m...@iae.nl (Marcel Hendrix) wrote: >> Alex McDonald <b...@rivadpm.com> writes Re: Anonymous code/data/create-does >[.] >> We are talking about a 64 x 64 float array and maybe 4 .. 8K of code. >> Would that not fit in the caches? > >Yes, but dirty cachelines are what can cause the problem; there's only >32KB of L1 cache on the 920 and each line is 64bytes. For an 8byte >float (or are you using 10 bytes?) that's 32KB exactly, and if 2 >threads dirty every cacheline, that's 64KB. I was using 10 bytes, but changing it to 8 bytes just now made only a tiny difference (10..20 ms). Remember, there are only 4 threads and 4 cores available. Why would there be a need to preempt a thread before it is finished? Cache trashing will of course happen because PAR creates 4 new threads each iteration. >I'm suspicious of your Linux figures; the Windows figures are what I >would expect for small time slice thread schedule & a good bit of >context switching along with cacheline issues. The Linux figures just >look like threads are getting long time slices, are therefore not >dirtying cachelines for other threads, and it's consequently just like >the non-parallel runs. I don't mind what the OS does, as long as the code runs (4 times) faster. Again a longer/shorter time slice would not matter as long as there are not more threads than CPUs? [..] I tried a lot of things, but none of them worked. Priority tweaks slightly improved matters in that the best time (2 times faster) happens more often. What did help is a change of algorithm: the thread runs over the partial matrix a number of times, not just once. I expected doing it this way would be less efficient (needing more iterations), but it appears to works well for the parallel case. Here are the code changes for solve4: #96 VALUE #iters : iterate4 ( F: -- sum ) 0e TO sum PAR STARTP #iters 0 do altxt0 EXECUTE +TO sum loop ENDP STARTP #iters 0 do altxt1 EXECUTE +TO sum loop ENDP STARTP #iters 0 do altxt2 EXECUTE +TO sum loop ENDP STARTP #iters 0 do altxt3 EXECUTE +TO sum loop ENDP ENDPAR sum ; PRIVATE The results change markedly: | ( Windows 7 ) | solve0 : noname (1) : after 4971 iterations and 0.551 seconds elapsed. | solve1 : anons (1) : after 4971 iterations and 0.520 seconds elapsed. | solve2 : parallel anons : after 4972 iterations and 1.250 seconds elapsed. | solve3 : nonames : after 4971 iterations and 0.172 seconds elapsed. | solve4 : parallel nonames : after 4973 iterations and 1.243 seconds elapsed. ok solve0 : noname (1) : after 4971 iterations and 0.533 seconds elapsed. solve1 : anons (1) : after 4971 iterations and 0.511 seconds elapsed. solve2 : parallel anons : after 110 iterations and 0.235 seconds elapsed. solve3 : nonames : after 4971 iterations and 0.174 seconds elapsed. solve4 : parallel nonames : after 72 iterations and 0.098 seconds elapsed. ok | ( Linux Ubuntu 12.4 ) | solve0 : noname (1) : after 4971 iterations and 0.431 seconds elapsed. | solve1 : anons (1) : after 4971 iterations and 0.412 seconds elapsed. | solve2 : parallel anons : after 4972 iterations and 0.542 seconds elapsed. | solve3 : nonames : after 4971 iterations and 0.140 seconds elapsed. | solve4 : parallel nonames : after 4973 iterations and 0.140 seconds elapsed. ok ( Linux Ubuntu 12.4 ) solve0 : noname (1) : after 4971 iterations and 0.431 seconds elapsed. solve1 : anons (1) : after 4971 iterations and 0.412 seconds elapsed. solve2 : parallel anons : after 108 iterations and 0.181 seconds elapsed. solve3 : nonames : after 4971 iterations and 0.136 seconds elapsed. solve4 : parallel nonames : after 105 iterations and 0.088 seconds elapsed. ok The unexplained difference between Windows and Linux seems to be gone (it even looks like Windows is now faster :-), and the threaded version is about twice faster than the serial one. Here is the performance for a 256x256 grid (instead of 64x64). Apparently threadswitching of small tasks is more expensive than I thought. For this gridsize the task manager shows that 4 cores are running flat out. With the 64x64 grid there is about 60% system time used. FORTH> bench solve0 : noname (1) : after 81642 iterations and 145.836 seconds elapsed. solve1 : anons (1) : after 81642 iterations and 122.164 seconds elapsed. solve2 : parallel anons : after 1744 iterations and 54.847 seconds elapsed. solve3 : nonames : after 81642 iterations and 45.562 seconds elapsed. solve4 : parallel nonames : after 1479 iterations and 21.092 seconds elapsed. ok -marcel
Back to comp.lang.forth | Previous | Next — Previous in thread | Next in thread | Find similar
Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-02 23:46 +0200
Re: Anonymous code/data/create-does "WJ" <w_a_x_man@yahoo.com> - 2013-04-02 22:51 +0000
Re: Anonymous code/data/create-does "WJ" <w_a_x_man@yahoo.com> - 2013-04-02 23:00 +0000
Re: Anonymous code/data/create-does Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2013-04-03 11:07 +0100
Re: Anonymous code/data/create-does m.a.m.hendrix@tue.nl - 2013-04-03 06:53 -0700
Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-03 13:43 -0700
Re: Anonymous code/data/create-does Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2013-04-04 18:07 +0100
Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-04 00:38 +0200
Re: Anonymous code/data/create-does Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2013-04-04 18:18 +0100
Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-04 20:29 +0200
Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-06 20:40 +0200
Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-06 14:09 -0700
Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-07 02:04 +0200
Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-06 20:34 -0700
Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-07 14:00 +0200
Re: Anonymous code/data/create-does kenney@cix.compulink.co.uk - 2013-04-08 02:38 -0500
Re: Anonymous code/data/create-does Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-04-08 10:58 +0200
Re: Anonymous code/data/create-does Alex McDonald <blog@rivadpm.com> - 2013-04-08 14:57 -0700
Re: Anonymous code/data/create-does Bernd Paysan <bernd.paysan@gmx.de> - 2013-04-08 19:26 +0200
Re: Anonymous code/data/create-does kenney@cix.compulink.co.uk - 2013-04-09 02:07 -0500
Re: Anonymous code/data/create-does mhx@iae.nl (Marcel Hendrix) - 2013-04-03 22:44 +0200
Re: Anonymous code/data/create-does Lars Brinkhoff <lars.spam@nocrew.org> - 2013-04-04 12:42 +0200
Re: Anonymous code/data/create-does Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-04-04 07:04 -0500
Re: Anonymous code/data/create-does Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-04-04 07:07 -0500
csiph-web