SSE2

From	mhx@iae.nl (Marcel Hendrix)
Subject	SSE2
Newsgroups	comp.lang.forth
Message-ID	<56151908968435@frunobulax.edu> (permalink)
Date	2012-07-27 16:20 +0200
Organization	Wanadoo

Show all headers | View raw

I had a fresh look at SSE2. Strangely, my tests fail to show up any 
performance difference with the good ol' FPU.

The test I did is based on an SSE2 DDOT implementation (DDOT_sse2, vector
multiplication). I won't show the code, it is too long and specific. 

What I did is derived from the MiniBLAS sources. As SSE2 operates on 
4 doubles at a time, speedups of around 4 are suggesting themselves. 
However, I can find no trace of this. An obvious reason could be that
memory throughput can not keep up with the FP units. Strange, as one 
would expect this hardware problem to be fixed by now.

Does anybody know if there is (currently) a hardware limitation in 
i7 920 CPU's? I vaguely remember having read somewhere that some corners
are cut because not everybody is using 64bit software yet, and obviously
clock speed 'sells' better than actual FP performance.

	FORTH> 100000000 TEST
	Using 64 bits floats.
	mul   :  1.2400000000000000000e+0011 2.055 seconds elapsed.
	mul1  :  1.2400000000000000000e+0011 3.141 seconds elapsed.
	mul2  :  1.2400000000000000000e+0011 1.230 seconds elapsed.
	mul3  :  1.2400000000000000000e+0011 1.604 seconds elapsed.
	mmul  :  4.6000000000000000000e+0010 2.376 seconds elapsed.
	mmul1 :  4.6000000000000000000e+0010 2.854 seconds elapsed.
	mmul3 :  4.6000000000000000000e+0010 3.785 seconds elapsed.
	mmul4 :  4.6000000000000000000e+0010 2.933 seconds elapsed. ok

The fastest DDOT variant, MUL2, uses direct memory addressing. In 
general, this can only work when Forth compiles at runtime (with 
quotations :-).

MUL1 is using indexed addressing. This is quite a lot slower (I hate 
this, because I have put a lot of effort in the iForth compiler to 
support it fully. After 10 years, it looks like Intel has still not 
found a way to improve the speed of this flexible addressing scheme.)

MUL is using a pointer instead of computing an array index. This is 
again faster than indexed addressing, but not as fast as an immediate 
address. (This is probably because every instruction, both integer and 
fp, counts here). 

MUL3 is using DDOT_sse2. It is worse than MUL2, not bad, but far 
from 400% faster.

The MMUL and MMUL1 are matrix multiplication words. They use the 
DDOTs defined earlier. Performance is as expected. 

The performance of MMUL3, which uses DDOT_sse2 is suspiciously low. 
It proved that having a normal FPU operation (here DF!)  at the end 
of a tight loop is a bad idea. (I also noticed this for DDOT_0 and 
it is the reason that I unrolled the loops). The word MMUL4 therefore 
unrolls everything. Not unsurprising, the perfomance increases, but 
only so much as to be inline with MUL3.

-marcel

-- -----------

CREATE  a1 #16 DFLOATS ALLOT
CREATE  a2 #16 DFLOATS ALLOT
CREATE  a3 #16 DFLOATS ALLOT

: filla ( -- ) a1 #16 0 DO  I S>F DF!+  LOOP DROP ; 
: fillb ( -- ) a2 #16 0 DO  I S>F DF!+  LOOP DROP ; 

: DDOT_0 ( a1 a2 n -- ) ( F: -- r ) 
	0e  2 RSHIFT
	0 ?DO  	DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
	 LOOP  2DROP ;

: DDOT_1 ( a1 a2 n -- ) ( F: -- r ) 
	0e  
	  0 ?DO	DUP I     DFLOAT[] DF@  OVER I     DFLOAT[] DF@ F* F+  
		DUP I 1+  DFLOAT[] DF@  OVER I 1+  DFLOAT[] DF@ F* F+  
		DUP I 2+  DFLOAT[] DF@  OVER I 2+  DFLOAT[] DF@ F* F+  
		DUP I 3 + DFLOAT[] DF@  OVER I 3 + DFLOAT[] DF@ F* F+  
	4 +LOOP 2DROP ;

: mul   ( F: -- r ) a1 a2  #16 DDOT_0 ;
: mul1  ( F: -- r ) a1 a2  #16 DDOT_1 ;

: mul2  ( F: -- r ) 
	0e  
	a1  0 DFLOAT[] DF@  a2  0 DFLOAT[] DF@ F* F+  
	a1  1 DFLOAT[] DF@  a2  1 DFLOAT[] DF@ F* F+  
	a1  2 DFLOAT[] DF@  a2  2 DFLOAT[] DF@ F* F+  
	a1  3 DFLOAT[] DF@  a2  3 DFLOAT[] DF@ F* F+  

	a1  4 DFLOAT[] DF@  a2  4 DFLOAT[] DF@ F* F+  
	a1  5 DFLOAT[] DF@  a2  5 DFLOAT[] DF@ F* F+  
	a1  6 DFLOAT[] DF@  a2  6 DFLOAT[] DF@ F* F+  
	a1  7 DFLOAT[] DF@  a2  7 DFLOAT[] DF@ F* F+  

	a1  8 DFLOAT[] DF@  a2  8 DFLOAT[] DF@ F* F+  
	a1  9 DFLOAT[] DF@  a2  9 DFLOAT[] DF@ F* F+  
	a1 10 DFLOAT[] DF@  a2 10 DFLOAT[] DF@ F* F+  
	a1 11 DFLOAT[] DF@  a2 11 DFLOAT[] DF@ F* F+  

	a1 12 DFLOAT[] DF@  a2 12 DFLOAT[] DF@ F* F+  
	a1 13 DFLOAT[] DF@  a2 13 DFLOAT[] DF@ F* F+  
	a1 14 DFLOAT[] DF@  a2 14 DFLOAT[] DF@ F* F+  
	a1 15 DFLOAT[] DF@  a2 15 DFLOAT[] DF@ F* F+  ;

: mul3  a1 a2 #16 DDOT_sse2 ; ( F: -- r )

: fillm1  ( -- ) filla ; 

: fillm12 ( -- ) \ a2 = a1^T
	fillm1
	4 0 DO 4 0 DO   J 4 * I + a1 []DFLOAT DF@
			I 4 * J + a2 []DFLOAT DF!
		 LOOP 
	  LOOP ;

: mmul  ( F: -- r )
	0e
	4 0 DO  4 0 DO  J 4 *     a1 []DFLOAT
			I 4 *     a2 []DFLOAT 4 DDOT_0
			J 4 * I + a3 []DFLOAT FDUP DF!  
			F+
		  LOOP 
	LOOP ;

: mmul1 ( F: -- r )
	0e
	4 0 DO  4 0 DO  J 4 *     a1 []DFLOAT
			I 4 *     a2 []DFLOAT 4 DDOT_1
			J 4 * I + a3 []DFLOAT FDUP DF!  
			F+
		  LOOP 
	LOOP ;

: mmul3 ( F: -- r )
	0e
	4 0 DO  4 0 DO  J 4 *     a1 []DFLOAT
			I 4 *     a2 []DFLOAT 4 DDOT_sse2  
			J 4 * I + a3 []DFLOAT FDUP DF!  
			F+
		  LOOP 
	 LOOP ;

: mmul4 ( F: -- r )
	0e
	4 0 DO  
			I 4 *     a1 []DFLOAT
			          a2          4 DDOT_sse2  
			I 4 *     a3 []DFLOAT FDUP DF!  F+

			I 4 *     a1 []DFLOAT
			  4       a2 []DFLOAT 4 DDOT_sse2  
			I 4 * 1+  a3 []DFLOAT FDUP DF!  F+

			I 4 *     a1 []DFLOAT
			  8       a2 []DFLOAT 4 DDOT_sse2  
			I 4 * 2+  a3 []DFLOAT FDUP DF!  F+

			I 4 *     a1 []DFLOAT
			#12       a2 []DFLOAT 4 DDOT_sse2  
			I 4 * 3 + a3 []DFLOAT FDUP DF!  F+
	 LOOP ;

: TEST ( u -- )
	1 UMAX LOCAL #tries
	#tries 3 RSHIFT 1 UMAX LOCAL #mtrys
	CR ." Using 64 bits floats."
	filla  fillb
	CR ." mul   : " TIMER-RESET 0e #tries 0 DO  mul   F+  LOOP +E. SPACE .ELAPSED
	CR ." mul1  : " TIMER-RESET 0e #tries 0 DO  mul1  F+  LOOP +E. SPACE .ELAPSED
	CR ." mul2  : " TIMER-RESET 0e #tries 0 DO  mul2  F+  LOOP +E. SPACE .ELAPSED
	CR ." mul3  : " TIMER-RESET 0e #tries 0 DO  mul3  F+  LOOP +E. SPACE .ELAPSED 

	CR ." mmul  : " TIMER-RESET 0e #mtrys 0 DO  mmul  F+  LOOP +E. SPACE .ELAPSED
	CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO  mmul1 F+  LOOP +E. SPACE .ELAPSED 
	CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO  mmul3 F+  LOOP +E. SPACE .ELAPSED 
	CR ." mmul4 : " TIMER-RESET 0e #mtrys 0 DO  mmul4 F+  LOOP +E. SPACE .ELAPSED ;

Back to comp.lang.forth | Previous | Next — Next in thread | Find similar

Thread

SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-27 16:20 +0200
  Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:03 -0700
    Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
  Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 14:56 +0000
    Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:56 -0700
      Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 16:28 +0000
      Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-07-28 01:47 +0200
        Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-31 00:28 +0200
          Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-08-01 11:27 +0200
            Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-01 20:23 +0200
            Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-05 21:00 +0200
    Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
      Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-31 14:29 +0000
  Re: SSE2 albert@cherry.spenarnc.xs4all.nl (Albert van der Horst) - 2012-07-27 16:31 +0000

csiph-web