Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #14457
| From | mhx@iae.nl (Marcel Hendrix) |
|---|---|
| Subject | SSE2 |
| Newsgroups | comp.lang.forth |
| Message-ID | <56151908968435@frunobulax.edu> (permalink) |
| Date | 2012-07-27 16:20 +0200 |
| Organization | Wanadoo |
I had a fresh look at SSE2. Strangely, my tests fail to show up any performance difference with the good ol' FPU. The test I did is based on an SSE2 DDOT implementation (DDOT_sse2, vector multiplication). I won't show the code, it is too long and specific. What I did is derived from the MiniBLAS sources. As SSE2 operates on 4 doubles at a time, speedups of around 4 are suggesting themselves. However, I can find no trace of this. An obvious reason could be that memory throughput can not keep up with the FP units. Strange, as one would expect this hardware problem to be fixed by now. Does anybody know if there is (currently) a hardware limitation in i7 920 CPU's? I vaguely remember having read somewhere that some corners are cut because not everybody is using 64bit software yet, and obviously clock speed 'sells' better than actual FP performance. FORTH> 100000000 TEST Using 64 bits floats. mul : 1.2400000000000000000e+0011 2.055 seconds elapsed. mul1 : 1.2400000000000000000e+0011 3.141 seconds elapsed. mul2 : 1.2400000000000000000e+0011 1.230 seconds elapsed. mul3 : 1.2400000000000000000e+0011 1.604 seconds elapsed. mmul : 4.6000000000000000000e+0010 2.376 seconds elapsed. mmul1 : 4.6000000000000000000e+0010 2.854 seconds elapsed. mmul3 : 4.6000000000000000000e+0010 3.785 seconds elapsed. mmul4 : 4.6000000000000000000e+0010 2.933 seconds elapsed. ok The fastest DDOT variant, MUL2, uses direct memory addressing. In general, this can only work when Forth compiles at runtime (with quotations :-). MUL1 is using indexed addressing. This is quite a lot slower (I hate this, because I have put a lot of effort in the iForth compiler to support it fully. After 10 years, it looks like Intel has still not found a way to improve the speed of this flexible addressing scheme.) MUL is using a pointer instead of computing an array index. This is again faster than indexed addressing, but not as fast as an immediate address. (This is probably because every instruction, both integer and fp, counts here). MUL3 is using DDOT_sse2. It is worse than MUL2, not bad, but far from 400% faster. The MMUL and MMUL1 are matrix multiplication words. They use the DDOTs defined earlier. Performance is as expected. The performance of MMUL3, which uses DDOT_sse2 is suspiciously low. It proved that having a normal FPU operation (here DF!) at the end of a tight loop is a bad idea. (I also noticed this for DDOT_0 and it is the reason that I unrolled the loops). The word MMUL4 therefore unrolls everything. Not unsurprising, the perfomance increases, but only so much as to be inline with MUL3. -marcel -- ----------- CREATE a1 #16 DFLOATS ALLOT CREATE a2 #16 DFLOATS ALLOT CREATE a3 #16 DFLOATS ALLOT : filla ( -- ) a1 #16 0 DO I S>F DF!+ LOOP DROP ; : fillb ( -- ) a2 #16 0 DO I S>F DF!+ LOOP DROP ; : DDOT_0 ( a1 a2 n -- ) ( F: -- r ) 0e 2 RSHIFT 0 ?DO DF@+ swap DF@+ F* F+ DF@+ swap DF@+ F* F+ DF@+ swap DF@+ F* F+ DF@+ swap DF@+ F* F+ LOOP 2DROP ; : DDOT_1 ( a1 a2 n -- ) ( F: -- r ) 0e 0 ?DO DUP I DFLOAT[] DF@ OVER I DFLOAT[] DF@ F* F+ DUP I 1+ DFLOAT[] DF@ OVER I 1+ DFLOAT[] DF@ F* F+ DUP I 2+ DFLOAT[] DF@ OVER I 2+ DFLOAT[] DF@ F* F+ DUP I 3 + DFLOAT[] DF@ OVER I 3 + DFLOAT[] DF@ F* F+ 4 +LOOP 2DROP ; : mul ( F: -- r ) a1 a2 #16 DDOT_0 ; : mul1 ( F: -- r ) a1 a2 #16 DDOT_1 ; : mul2 ( F: -- r ) 0e a1 0 DFLOAT[] DF@ a2 0 DFLOAT[] DF@ F* F+ a1 1 DFLOAT[] DF@ a2 1 DFLOAT[] DF@ F* F+ a1 2 DFLOAT[] DF@ a2 2 DFLOAT[] DF@ F* F+ a1 3 DFLOAT[] DF@ a2 3 DFLOAT[] DF@ F* F+ a1 4 DFLOAT[] DF@ a2 4 DFLOAT[] DF@ F* F+ a1 5 DFLOAT[] DF@ a2 5 DFLOAT[] DF@ F* F+ a1 6 DFLOAT[] DF@ a2 6 DFLOAT[] DF@ F* F+ a1 7 DFLOAT[] DF@ a2 7 DFLOAT[] DF@ F* F+ a1 8 DFLOAT[] DF@ a2 8 DFLOAT[] DF@ F* F+ a1 9 DFLOAT[] DF@ a2 9 DFLOAT[] DF@ F* F+ a1 10 DFLOAT[] DF@ a2 10 DFLOAT[] DF@ F* F+ a1 11 DFLOAT[] DF@ a2 11 DFLOAT[] DF@ F* F+ a1 12 DFLOAT[] DF@ a2 12 DFLOAT[] DF@ F* F+ a1 13 DFLOAT[] DF@ a2 13 DFLOAT[] DF@ F* F+ a1 14 DFLOAT[] DF@ a2 14 DFLOAT[] DF@ F* F+ a1 15 DFLOAT[] DF@ a2 15 DFLOAT[] DF@ F* F+ ; : mul3 a1 a2 #16 DDOT_sse2 ; ( F: -- r ) : fillm1 ( -- ) filla ; : fillm12 ( -- ) \ a2 = a1^T fillm1 4 0 DO 4 0 DO J 4 * I + a1 []DFLOAT DF@ I 4 * J + a2 []DFLOAT DF! LOOP LOOP ; : mmul ( F: -- r ) 0e 4 0 DO 4 0 DO J 4 * a1 []DFLOAT I 4 * a2 []DFLOAT 4 DDOT_0 J 4 * I + a3 []DFLOAT FDUP DF! F+ LOOP LOOP ; : mmul1 ( F: -- r ) 0e 4 0 DO 4 0 DO J 4 * a1 []DFLOAT I 4 * a2 []DFLOAT 4 DDOT_1 J 4 * I + a3 []DFLOAT FDUP DF! F+ LOOP LOOP ; : mmul3 ( F: -- r ) 0e 4 0 DO 4 0 DO J 4 * a1 []DFLOAT I 4 * a2 []DFLOAT 4 DDOT_sse2 J 4 * I + a3 []DFLOAT FDUP DF! F+ LOOP LOOP ; : mmul4 ( F: -- r ) 0e 4 0 DO I 4 * a1 []DFLOAT a2 4 DDOT_sse2 I 4 * a3 []DFLOAT FDUP DF! F+ I 4 * a1 []DFLOAT 4 a2 []DFLOAT 4 DDOT_sse2 I 4 * 1+ a3 []DFLOAT FDUP DF! F+ I 4 * a1 []DFLOAT 8 a2 []DFLOAT 4 DDOT_sse2 I 4 * 2+ a3 []DFLOAT FDUP DF! F+ I 4 * a1 []DFLOAT #12 a2 []DFLOAT 4 DDOT_sse2 I 4 * 3 + a3 []DFLOAT FDUP DF! F+ LOOP ; : TEST ( u -- ) 1 UMAX LOCAL #tries #tries 3 RSHIFT 1 UMAX LOCAL #mtrys CR ." Using 64 bits floats." filla fillb CR ." mul : " TIMER-RESET 0e #tries 0 DO mul F+ LOOP +E. SPACE .ELAPSED CR ." mul1 : " TIMER-RESET 0e #tries 0 DO mul1 F+ LOOP +E. SPACE .ELAPSED CR ." mul2 : " TIMER-RESET 0e #tries 0 DO mul2 F+ LOOP +E. SPACE .ELAPSED CR ." mul3 : " TIMER-RESET 0e #tries 0 DO mul3 F+ LOOP +E. SPACE .ELAPSED CR ." mmul : " TIMER-RESET 0e #mtrys 0 DO mmul F+ LOOP +E. SPACE .ELAPSED CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO mmul1 F+ LOOP +E. SPACE .ELAPSED CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO mmul3 F+ LOOP +E. SPACE .ELAPSED CR ." mmul4 : " TIMER-RESET 0e #mtrys 0 DO mmul4 F+ LOOP +E. SPACE .ELAPSED ;
Back to comp.lang.forth | Previous | Next — Next in thread | Find similar
SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-27 16:20 +0200
Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:03 -0700
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 14:56 +0000
Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:56 -0700
Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 16:28 +0000
Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-07-28 01:47 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-31 00:28 +0200
Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-08-01 11:27 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-01 20:23 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-05 21:00 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-31 14:29 +0000
Re: SSE2 albert@cherry.spenarnc.xs4all.nl (Albert van der Horst) - 2012-07-27 16:31 +0000
csiph-web