Groups > comp.lang.forth > #14457 > unrolled thread

SSE2

Started by	mhx@iae.nl (Marcel Hendrix)
First post	2012-07-27 16:20 +0200
Last post	2012-07-27 16:31 +0000
Articles	14 — 5 participants

Back to article view | Back to comp.lang.forth

  SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-27 16:20 +0200
    Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:03 -0700
      Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
    Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 14:56 +0000
      Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:56 -0700
        Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 16:28 +0000
        Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-07-28 01:47 +0200
          Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-31 00:28 +0200
            Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-08-01 11:27 +0200
              Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-01 20:23 +0200
              Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-05 21:00 +0200
      Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
        Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-31 14:29 +0000
    Re: SSE2 albert@cherry.spenarnc.xs4all.nl (Albert van der Horst) - 2012-07-27 16:31 +0000

#14457 — SSE2

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-27 16:20 +0200
Subject	SSE2
Message-ID	<56151908968435@frunobulax.edu>

I had a fresh look at SSE2. Strangely, my tests fail to show up any 
performance difference with the good ol' FPU.

The test I did is based on an SSE2 DDOT implementation (DDOT_sse2, vector
multiplication). I won't show the code, it is too long and specific. 

What I did is derived from the MiniBLAS sources. As SSE2 operates on 
4 doubles at a time, speedups of around 4 are suggesting themselves. 
However, I can find no trace of this. An obvious reason could be that
memory throughput can not keep up with the FP units. Strange, as one 
would expect this hardware problem to be fixed by now.

Does anybody know if there is (currently) a hardware limitation in 
i7 920 CPU's? I vaguely remember having read somewhere that some corners
are cut because not everybody is using 64bit software yet, and obviously
clock speed 'sells' better than actual FP performance.

	FORTH> 100000000 TEST
	Using 64 bits floats.
	mul   :  1.2400000000000000000e+0011 2.055 seconds elapsed.
	mul1  :  1.2400000000000000000e+0011 3.141 seconds elapsed.
	mul2  :  1.2400000000000000000e+0011 1.230 seconds elapsed.
	mul3  :  1.2400000000000000000e+0011 1.604 seconds elapsed.
	mmul  :  4.6000000000000000000e+0010 2.376 seconds elapsed.
	mmul1 :  4.6000000000000000000e+0010 2.854 seconds elapsed.
	mmul3 :  4.6000000000000000000e+0010 3.785 seconds elapsed.
	mmul4 :  4.6000000000000000000e+0010 2.933 seconds elapsed. ok

The fastest DDOT variant, MUL2, uses direct memory addressing. In 
general, this can only work when Forth compiles at runtime (with 
quotations :-).

MUL1 is using indexed addressing. This is quite a lot slower (I hate 
this, because I have put a lot of effort in the iForth compiler to 
support it fully. After 10 years, it looks like Intel has still not 
found a way to improve the speed of this flexible addressing scheme.)

MUL is using a pointer instead of computing an array index. This is 
again faster than indexed addressing, but not as fast as an immediate 
address. (This is probably because every instruction, both integer and 
fp, counts here). 

MUL3 is using DDOT_sse2. It is worse than MUL2, not bad, but far 
from 400% faster.

The MMUL and MMUL1 are matrix multiplication words. They use the 
DDOTs defined earlier. Performance is as expected. 

The performance of MMUL3, which uses DDOT_sse2 is suspiciously low. 
It proved that having a normal FPU operation (here DF!)  at the end 
of a tight loop is a bad idea. (I also noticed this for DDOT_0 and 
it is the reason that I unrolled the loops). The word MMUL4 therefore 
unrolls everything. Not unsurprising, the perfomance increases, but 
only so much as to be inline with MUL3.

-marcel

-- -----------

CREATE  a1 #16 DFLOATS ALLOT
CREATE  a2 #16 DFLOATS ALLOT
CREATE  a3 #16 DFLOATS ALLOT

: filla ( -- ) a1 #16 0 DO  I S>F DF!+  LOOP DROP ; 
: fillb ( -- ) a2 #16 0 DO  I S>F DF!+  LOOP DROP ; 

: DDOT_0 ( a1 a2 n -- ) ( F: -- r ) 
	0e  2 RSHIFT
	0 ?DO  	DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
	 LOOP  2DROP ;

: DDOT_1 ( a1 a2 n -- ) ( F: -- r ) 
	0e  
	  0 ?DO	DUP I     DFLOAT[] DF@  OVER I     DFLOAT[] DF@ F* F+  
		DUP I 1+  DFLOAT[] DF@  OVER I 1+  DFLOAT[] DF@ F* F+  
		DUP I 2+  DFLOAT[] DF@  OVER I 2+  DFLOAT[] DF@ F* F+  
		DUP I 3 + DFLOAT[] DF@  OVER I 3 + DFLOAT[] DF@ F* F+  
	4 +LOOP 2DROP ;

: mul   ( F: -- r ) a1 a2  #16 DDOT_0 ;
: mul1  ( F: -- r ) a1 a2  #16 DDOT_1 ;

: mul2  ( F: -- r ) 
	0e  
	a1  0 DFLOAT[] DF@  a2  0 DFLOAT[] DF@ F* F+  
	a1  1 DFLOAT[] DF@  a2  1 DFLOAT[] DF@ F* F+  
	a1  2 DFLOAT[] DF@  a2  2 DFLOAT[] DF@ F* F+  
	a1  3 DFLOAT[] DF@  a2  3 DFLOAT[] DF@ F* F+  

	a1  4 DFLOAT[] DF@  a2  4 DFLOAT[] DF@ F* F+  
	a1  5 DFLOAT[] DF@  a2  5 DFLOAT[] DF@ F* F+  
	a1  6 DFLOAT[] DF@  a2  6 DFLOAT[] DF@ F* F+  
	a1  7 DFLOAT[] DF@  a2  7 DFLOAT[] DF@ F* F+  

	a1  8 DFLOAT[] DF@  a2  8 DFLOAT[] DF@ F* F+  
	a1  9 DFLOAT[] DF@  a2  9 DFLOAT[] DF@ F* F+  
	a1 10 DFLOAT[] DF@  a2 10 DFLOAT[] DF@ F* F+  
	a1 11 DFLOAT[] DF@  a2 11 DFLOAT[] DF@ F* F+  

	a1 12 DFLOAT[] DF@  a2 12 DFLOAT[] DF@ F* F+  
	a1 13 DFLOAT[] DF@  a2 13 DFLOAT[] DF@ F* F+  
	a1 14 DFLOAT[] DF@  a2 14 DFLOAT[] DF@ F* F+  
	a1 15 DFLOAT[] DF@  a2 15 DFLOAT[] DF@ F* F+  ;

: mul3  a1 a2 #16 DDOT_sse2 ; ( F: -- r )

: fillm1  ( -- ) filla ; 

: fillm12 ( -- ) \ a2 = a1^T
	fillm1
	4 0 DO 4 0 DO   J 4 * I + a1 []DFLOAT DF@
			I 4 * J + a2 []DFLOAT DF!
		 LOOP 
	  LOOP ;

: mmul  ( F: -- r )
	0e
	4 0 DO  4 0 DO  J 4 *     a1 []DFLOAT
			I 4 *     a2 []DFLOAT 4 DDOT_0
			J 4 * I + a3 []DFLOAT FDUP DF!  
			F+
		  LOOP 
	LOOP ;

: mmul1 ( F: -- r )
	0e
	4 0 DO  4 0 DO  J 4 *     a1 []DFLOAT
			I 4 *     a2 []DFLOAT 4 DDOT_1
			J 4 * I + a3 []DFLOAT FDUP DF!  
			F+
		  LOOP 
	LOOP ;

: mmul3 ( F: -- r )
	0e
	4 0 DO  4 0 DO  J 4 *     a1 []DFLOAT
			I 4 *     a2 []DFLOAT 4 DDOT_sse2  
			J 4 * I + a3 []DFLOAT FDUP DF!  
			F+
		  LOOP 
	 LOOP ;

: mmul4 ( F: -- r )
	0e
	4 0 DO  
			I 4 *     a1 []DFLOAT
			          a2          4 DDOT_sse2  
			I 4 *     a3 []DFLOAT FDUP DF!  F+

			I 4 *     a1 []DFLOAT
			  4       a2 []DFLOAT 4 DDOT_sse2  
			I 4 * 1+  a3 []DFLOAT FDUP DF!  F+

			I 4 *     a1 []DFLOAT
			  8       a2 []DFLOAT 4 DDOT_sse2  
			I 4 * 2+  a3 []DFLOAT FDUP DF!  F+

			I 4 *     a1 []DFLOAT
			#12       a2 []DFLOAT 4 DDOT_sse2  
			I 4 * 3 + a3 []DFLOAT FDUP DF!  F+
	 LOOP ;

: TEST ( u -- )
	1 UMAX LOCAL #tries
	#tries 3 RSHIFT 1 UMAX LOCAL #mtrys
	CR ." Using 64 bits floats."
	filla  fillb
	CR ." mul   : " TIMER-RESET 0e #tries 0 DO  mul   F+  LOOP +E. SPACE .ELAPSED
	CR ." mul1  : " TIMER-RESET 0e #tries 0 DO  mul1  F+  LOOP +E. SPACE .ELAPSED
	CR ." mul2  : " TIMER-RESET 0e #tries 0 DO  mul2  F+  LOOP +E. SPACE .ELAPSED
	CR ." mul3  : " TIMER-RESET 0e #tries 0 DO  mul3  F+  LOOP +E. SPACE .ELAPSED 

	CR ." mmul  : " TIMER-RESET 0e #mtrys 0 DO  mmul  F+  LOOP +E. SPACE .ELAPSED
	CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO  mmul1 F+  LOOP +E. SPACE .ELAPSED 
	CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO  mmul3 F+  LOOP +E. SPACE .ELAPSED 
	CR ." mmul4 : " TIMER-RESET 0e #mtrys 0 DO  mmul4 F+  LOOP +E. SPACE .ELAPSED ;

[toc] | [next] | [standalone]

#14459

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-27 08:03 -0700
Message-ID	<7x394dnt48.fsf@ruckus.brouhaha.com>
In reply to	#14457

mhx@iae.nl (Marcel Hendrix) writes:
> What I did is derived from the MiniBLAS sources. As SSE2 operates on 
> 4 doubles at a time, speedups of around 4 are suggesting themselves. 

Scalar operations use 2 doubles at a time so that suggests speedup of 2
rather than 4.  But, a parallel 64 bit multiplier is probably about the
most expensive resource in the ALU, so there may be just one of them.
In that case it would be the bottleneck and DDOT would be about the same
as 2 scalar multiplications.  You could try a single precision version
that would give more speedup.

Are you using the DPPD (double precision dot product) instruction from
SSE4.1?  I'd expect that to be fastest.

Later i7's (Sandy Bridge) have the 256 bit AVX instructions that may be
faster still, and the forthcoming Haswell will have fused MAC.  For
maximum speed of course you want to use a GPU.

[toc] | [prev] | [next] | [standalone]

#14480

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-28 15:28 +0200
Message-ID	<18078807968435@frunobulax.edu>
In reply to	#14459

Paul Rubin <no.email@nospam.invalid> writes Re: SSE2

> mhx@iae.nl (Marcel Hendrix) writes:
>> What I did is derived from the MiniBLAS sources. As SSE2 operates on 
>> 4 doubles at a time, speedups of around 4 are suggesting themselves. 

> Scalar operations use 2 doubles at a time so that suggests speedup of 2
> rather than 4.  But, a parallel 64 bit multiplier is probably about the
> most expensive resource in the ALU, so there may be just one of them.
> In that case it would be the bottleneck and DDOT would be about the same
> as 2 scalar multiplications.  You could try a single precision version
> that would give more speedup.

Sorry, of course I should have said 200%, not 400%. SSE2 packs 2 doubles
in a "word".

I followed your suggestion and tried single-precision (SF@ etc.).

> Are you using the DPPD (double precision dot product) instruction from
> SSE4.1?  I'd expect that to be fastest.

It seems to me that this instruction is just more convenient, not faster. 
It still works on 2 doubles.

> Later i7's (Sandy Bridge) have the 256 bit AVX instructions that may be
> faster still, and the forthcoming Haswell will have fused MAC.  For
> maximum speed of course you want to use a GPU.

Interesting suggestions, but unfortunately it would be real work to 
update my 64 bit assembler for AVX. Going for a GPU is a very interesting 
option. I see they do double-precision, have at least 6 GB of memory and
can so (some) special functions. But would they be a 'transparent' 
resource?

The original problem was that I ran DDOT for a vector size of 16. This is too 
small for the code I used (it goes flat out with 16/32 floats / block).

Conclusions (in the context of iForth):

1) There is almost no difference between SINGLE and DOUBLE for smallish
   vectors (16). For 1024 element vectors the SSE2 speed difference is 
   about 1.5. I assume it will reach 2 eventually.

2) SSE2 works for large vectors and gives about 4x more performance for
   SDOT. It also works for DDOT and gives more than 2x the performance
   that can be had otherwise. 

3) The matrix code does not speed up as much as I expected. It should
be proportional to the S/DDOT speed, but apparently there is extra 
overhead that does not go down very far (yet) with 1024 element vectors.

Most of my work is with quite small vectors (3 - 6 elements). However, 
SPICE jobs should speed up a great deal with SSE and friends. 

-marcel

PS: In the below printout only MUL3 and MMUL3 use SSE2.

-- ------------------------------------------
(*
 * LANGUAGE    : ANS Forth with extensions
 * PROJECT     : Forth Environments
 * DESCRIPTION : Array code compiler
 * CATEGORY    : Example 
 * AUTHOR      : Marcel Hendrix 
 * LAST CHANGE : July 26, 2012, Marcel Hendrix 
 *)



	NEEDS -miscutil

	REVISION -arraymulc "--- Array test          Version 0.00 ---"

	PRIVATES

DOC
(*
	FORTH> 10000000 TEST
	Using 64 bits floats.
	Vector size = 16
	mul   :  1.2400000000000000000e+0010 0.218 seconds elapsed.
	mul1  :  1.2400000000000000000e+0010 0.314 seconds elapsed.
	mul2  :  1.2400000000000000000e+0010 0.198 seconds elapsed.
	mul3  :  1.2400000000000000000e+0010 0.161 seconds elapsed.
	mmul  :  4.6000000000000000000e+0009 0.238 seconds elapsed.
	mmul1 :  4.6000000000000000000e+0009 0.286 seconds elapsed.
	mmul3 :  4.6000000000000000000e+0009 0.358 seconds elapsed. ok

	FORTH> 10000000 TEST
	Using 32 bits floats.
	Vector size = 16
	mul   :  1.2400000000000000000e+0010 0.214 seconds elapsed.
	mul1  :  1.2400000000000000000e+0010 0.314 seconds elapsed.
	mul2  :  1.2400000000000000000e+0010 0.150 seconds elapsed.
	mul3  :  1.2400000000000000000e+0010 0.134 seconds elapsed.
	mmul  :  4.6000000000000000000e+0009 0.239 seconds elapsed.
	mmul1 :  4.6000000000000000000e+0009 0.284 seconds elapsed.
	mmul3 :  4.6000000000000000000e+0009 0.356 seconds elapsed. ok


	FORTH> 1000000 TEST
	Using 32 bits floats.
	Vector size = 1024
	mul   :  3.5738982400000000000e+0014 1.117 seconds elapsed.
	mul1  :  3.5738982400000000000e+0014 1.837 seconds elapsed.
	mul3  :  3.5738979200000000000e+0014 0.233 seconds elapsed.
	mmul  :  1.0719948800000000000e+0015 4.995 seconds elapsed.
	mmul1 :  1.0719948800000000000e+0015 7.768 seconds elapsed.
	mmul3 :  1.0719948800000000000e+0015 2.526 seconds elapsed. ok


	FORTH> 1000000 TEST
	Using 64 bits floats.
	Vector size = 1024
	mul   :  3.5738982400000000000e+0014 1.067 seconds elapsed.
	mul1  :  3.5738982400000000000e+0014 1.788 seconds elapsed.
	mul3  :  3.5738982400000000000e+0014 0.358 seconds elapsed.
	mmul  :  1.0719948800000000000e+0015 4.913 seconds elapsed.
	mmul1 :  1.0719948800000000000e+0015 7.784 seconds elapsed.
	mmul3 :  1.0719948800000000000e+0015 2.952 seconds elapsed. ok
*)
ENDDOC

4          =: /rsz
/rsz DUP * =: /size
FALSE       =: 64bitsf?

64bitsf?
  [IF]

CREATE  a1 /size DFLOATS ALLOT
CREATE  a2 /size DFLOATS ALLOT
CREATE  a3 /size DFLOATS ALLOT

: filla ( -- ) a1 /size 0 DO  I S>F DF!+  LOOP DROP ; 
: fillb ( -- ) a2 /size 0 DO  I S>F DF!+  LOOP DROP ; 

: DDOT_0 ( a1 a2 n -- ) ( F: -- r ) 
	0e  2 RSHIFT
	0 ?DO  	DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
		DF@+ swap DF@+ F* F+  
	 LOOP  2DROP ;

: DDOT_1 ( a1 a2 n -- ) ( F: -- r ) 
	0e  
	  0 ?DO	DUP I     DFLOAT[] DF@  OVER I     DFLOAT[] DF@ F* F+  
		DUP I 1+  DFLOAT[] DF@  OVER I 1+  DFLOAT[] DF@ F* F+  
		DUP I 2+  DFLOAT[] DF@  OVER I 2+  DFLOAT[] DF@ F* F+  
		DUP I 3 + DFLOAT[] DF@  OVER I 3 + DFLOAT[] DF@ F* F+  
	4 +LOOP 2DROP ;

: mul   ( F: -- r ) a1 a2  /size DDOT_0 ;
: mul1  ( F: -- r ) a1 a2  /size DDOT_1 ;

: mul2  ( F: -- r ) 
	0e  /size #16 <> ?EXIT
	a1  0 DFLOAT[] DF@  a2  0 DFLOAT[] DF@ F* F+  
	a1  1 DFLOAT[] DF@  a2  1 DFLOAT[] DF@ F* F+  
	a1  2 DFLOAT[] DF@  a2  2 DFLOAT[] DF@ F* F+  
	a1  3 DFLOAT[] DF@  a2  3 DFLOAT[] DF@ F* F+  

	a1  4 DFLOAT[] DF@  a2  4 DFLOAT[] DF@ F* F+  
	a1  5 DFLOAT[] DF@  a2  5 DFLOAT[] DF@ F* F+  
	a1  6 DFLOAT[] DF@  a2  6 DFLOAT[] DF@ F* F+  
	a1  7 DFLOAT[] DF@  a2  7 DFLOAT[] DF@ F* F+  

	a1  8 DFLOAT[] DF@  a2  8 DFLOAT[] DF@ F* F+  
	a1  9 DFLOAT[] DF@  a2  9 DFLOAT[] DF@ F* F+  
	a1 10 DFLOAT[] DF@  a2 10 DFLOAT[] DF@ F* F+  
	a1 11 DFLOAT[] DF@  a2 11 DFLOAT[] DF@ F* F+  

	a1 12 DFLOAT[] DF@  a2 12 DFLOAT[] DF@ F* F+  
	a1 13 DFLOAT[] DF@  a2 13 DFLOAT[] DF@ F* F+  
	a1 14 DFLOAT[] DF@  a2 14 DFLOAT[] DF@ F* F+  
	a1 15 DFLOAT[] DF@  a2 15 DFLOAT[] DF@ F* F+  ;

: mul3  a1 a2 /size DDOT_sse2 ; ( F: -- r )

: fillm1  ( -- ) filla ; 

: fillm12 ( -- ) \ a2 = a1^T
	fillm1
	/rsz 0 DO 
		/rsz 0 DO   
			   J /rsz * I + a1 []DFLOAT DF@
			   I /rsz * J + a2 []DFLOAT DF!
		    LOOP 
	  LOOP ;

: mmul  ( F: -- r )
	0e
	/rsz 0 DO  
		/rsz 0 DO  J /rsz *     a1 []DFLOAT
			   I /rsz *     a2 []DFLOAT /rsz DDOT_0
			   J /rsz * I + a3 []DFLOAT FDUP DF!  
			   F+
		     LOOP 
	     LOOP ;

: mmul1 ( F: -- r )
	0e
	/rsz 0 DO  
		/rsz 0 DO  J /rsz *     a1 []DFLOAT
			   I /rsz *     a2 []DFLOAT /rsz DDOT_1
			   J /rsz * I + a3 []DFLOAT FDUP DF!  
			   F+
		     LOOP 
	LOOP ;

: mmul3 ( F: -- r )
	0e
	/rsz 0 DO  
		/rsz 0 DO  J /rsz *     a1 []DFLOAT
			   I /rsz *     a2 []DFLOAT /rsz DDOT_sse2  
			   J /rsz * I + a3 []DFLOAT FDUP DF!  
			   F+
		     LOOP 
	 LOOP ;

[ELSE]

CREATE a1 /size SFLOATS ALLOT
CREATE a2 /size SFLOATS ALLOT
CREATE a3 /size SFLOATS ALLOT

: filla ( -- ) a1 /size 0 DO  I S>F SF!+  LOOP DROP ; 
: fillb ( -- ) a2 /size 0 DO  I S>F SF!+  LOOP DROP ; 

: SDOT_0 ( a1 a2 n -- ) ( F: -- r ) 
	0e  2 RSHIFT
	0 ?DO  	SF@+ swap SF@+ F* F+  
		SF@+ swap SF@+ F* F+  
		SF@+ swap SF@+ F* F+  
		SF@+ swap SF@+ F* F+  
	 LOOP  2DROP ;

: SDOT_1 ( a1 a2 n -- ) ( F: -- r ) 
	0e  
	  0 ?DO	DUP I     SFLOAT[] SF@  OVER I     SFLOAT[] SF@ F* F+  
		DUP I 1+  SFLOAT[] SF@  OVER I 1+  SFLOAT[] SF@ F* F+  
		DUP I 2+  SFLOAT[] SF@  OVER I 2+  SFLOAT[] SF@ F* F+  
		DUP I 3 + SFLOAT[] SF@  OVER I 3 + SFLOAT[] SF@ F* F+  
	4 +LOOP 2DROP ;

: mul   ( F: -- r ) a1 a2  /size SDOT_0 ;
: mul1  ( F: -- r ) a1 a2  /size SDOT_1 ;

: mul2  0e  /size #16 <> ?EXIT
	a1  0 SFLOAT[] SF@  a2  0 SFLOAT[] SF@ F* F+  
	a1  1 SFLOAT[] SF@  a2  1 SFLOAT[] SF@ F* F+  
	a1  2 SFLOAT[] SF@  a2  2 SFLOAT[] SF@ F* F+  
	a1  3 SFLOAT[] SF@  a2  3 SFLOAT[] SF@ F* F+  

	a1  4 SFLOAT[] SF@  a2  4 SFLOAT[] SF@ F* F+  
	a1  5 SFLOAT[] SF@  a2  5 SFLOAT[] SF@ F* F+  
	a1  6 SFLOAT[] SF@  a2  6 SFLOAT[] SF@ F* F+  
	a1  7 SFLOAT[] SF@  a2  7 SFLOAT[] SF@ F* F+  

	a1  8 SFLOAT[] SF@  a2  8 SFLOAT[] SF@ F* F+  
	a1  9 SFLOAT[] SF@  a2  9 SFLOAT[] SF@ F* F+  
	a1 10 SFLOAT[] SF@  a2 10 SFLOAT[] SF@ F* F+  
	a1 11 SFLOAT[] SF@  a2 11 SFLOAT[] SF@ F* F+  

	a1 12 SFLOAT[] SF@  a2 12 SFLOAT[] SF@ F* F+  
	a1 13 SFLOAT[] SF@  a2 13 SFLOAT[] SF@ F* F+  
	a1 14 SFLOAT[] SF@  a2 14 SFLOAT[] SF@ F* F+  
	a1 15 SFLOAT[] SF@  a2 15 SFLOAT[] SF@ F* F+ ;

: mul3  a1 a2 /size SDOT_sse2 ; ( F: -- r )

: fillm1  ( -- ) filla ; 

: fillm12 ( -- ) \ a2 = a1^T
	fillm1
	/rsz 0 DO  /rsz 0 DO   	J /rsz * I + a1 []SFLOAT SF@
				I /rsz * J + a2 []SFLOAT SF!
		        LOOP 
	  LOOP ;

: mmul  ( F: -- r )
	0e
	/rsz 0 DO  /rsz 0 DO  	J /rsz *     a1 []SFLOAT
				I /rsz *     a2 []SFLOAT /rsz SDOT_0
				J /rsz * I + a3 []SFLOAT FDUP SF!  
				F+
		        LOOP 
	     LOOP ;

: mmul1 ( F: -- r )
	0e
	/rsz 0 DO  /rsz 0 DO  	J /rsz *     a1 []SFLOAT
				I /rsz *     a2 []SFLOAT /rsz SDOT_1
				J /rsz * I + a3 []SFLOAT FDUP SF!  
				F+
		  	LOOP 
	     LOOP ;

: mmul3 ( F: -- r )
	0e
	/rsz 0 DO  /rsz 0 DO  	J /rsz *     a1 []SFLOAT
				I /rsz *     a2 []SFLOAT /rsz SDOT_sse2  
				J /rsz * I + a3 []SFLOAT FDUP SF!  
				F+
		  	LOOP 
	     LOOP ;

[THEN]

: TEST ( u -- )
	1 UMAX LOCAL #tries
	#tries 3 RSHIFT 1 UMAX LOCAL #mtrys
	CR ." Using " 64bitsf? IF #64 ELSE #32 ENDIF DEC. ." bits floats."
	CR ." Vector size = " /size DEC.
	filla  fillb
	CR ." mul   : " TIMER-RESET 0e #tries 0 DO  mul   F+  LOOP +E. SPACE .ELAPSED
	CR ." mul1  : " TIMER-RESET 0e #tries 0 DO  mul1  F+  LOOP +E. SPACE .ELAPSED
	/size #16 = IF   CR ." mul2  : " TIMER-RESET 0e #tries 0 DO  mul2  F+  LOOP +E. SPACE .ELAPSED  ENDIF
	CR ." mul3  : " TIMER-RESET 0e #tries 0 DO  mul3  F+  LOOP +E. SPACE .ELAPSED 

	CR ." mmul  : " TIMER-RESET 0e #mtrys 0 DO  mmul  F+  LOOP +E. SPACE .ELAPSED
	CR ." mmul1 : " TIMER-RESET 0e #mtrys 0 DO  mmul1 F+  LOOP +E. SPACE .ELAPSED 
	CR ." mmul3 : " TIMER-RESET 0e #mtrys 0 DO  mmul3 F+  LOOP +E. SPACE .ELAPSED ;

:ABOUT	CR ." Try: ( u -- ) TEST " ;

		.ABOUT -arraymulc CR
		DEPRIVE

                              (* End of Source *)

[toc] | [prev] | [next] | [standalone]

#14460

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2012-07-27 14:56 +0000
Message-ID	<2012Jul27.165605@mips.complang.tuwien.ac.at>
In reply to	#14457

mhx@iae.nl (Marcel Hendrix) writes:
>What I did is derived from the MiniBLAS sources. As SSE2 operates on 
>4 doubles at a time, speedups of around 4 are suggesting themselves. 
>However, I can find no trace of this. An obvious reason could be that
>memory throughput can not keep up with the FP units. Strange, as one 
>would expect this hardware problem to be fixed by now.

If the problem in your benchmark is that it is memory bandwidth
limited, no, that problem is not "fixed".  Much software run on PCs is
not bandwidth limited, so it would not be economical to add an
expensive memory system for more bandwidth, certainly not for every
PC.  There are "PCs" with more memory bandwidth (typically servers),
and supercomputers often have more memory bandwidth relative to FLOPS,
but even there the trend is to lower that ratio (because additional
FPUs are cheaper than additional bandwidth, and additional FPUs are
useful for some supercomputer software, and because many
supercomputers are based on PC CPUs nowadays).  GPUs tend to have a
high memory bandwidth, but also high FLOPS.

>	FORTH> 100000000 TEST
>	Using 64 bits floats.
>	mul   :  1.2400000000000000000e+0011 2.055 seconds elapsed.
>	mul1  :  1.2400000000000000000e+0011 3.141 seconds elapsed.
>	mul2  :  1.2400000000000000000e+0011 1.230 seconds elapsed.
>	mul3  :  1.2400000000000000000e+0011 1.604 seconds elapsed.
>	mmul  :  4.6000000000000000000e+0010 2.376 seconds elapsed.
>	mmul1 :  4.6000000000000000000e+0010 2.854 seconds elapsed.
>	mmul3 :  4.6000000000000000000e+0010 3.785 seconds elapsed.
>	mmul4 :  4.6000000000000000000e+0010 2.933 seconds elapsed. ok
>
>The fastest DDOT variant, MUL2, uses direct memory addressing. In 
>general, this can only work when Forth compiles at runtime (with 
>quotations :-).

Hmm, if the problem is memory bandwidth, I would expect all variants
to have the same performance (unless you use additional indirection
vectors or varying memory layouts).

>MUL1 is using indexed addressing. This is quite a lot slower (I hate 
>this, because I have put a lot of effort in the iForth compiler to 
>support it fully. After 10 years, it looks like Intel has still not 
>found a way to improve the speed of this flexible addressing scheme.)

If the memory accesses are the same, then your problem is not memory
bandwidth.

My _guess_ is that your compiler produces some stack memory stores for
some of the variants, with some stack memory fetches soon after, and
these cost quie a bit of performance.  At least they used to.  You
don't have a data-near-code problem, do you?

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/

[toc] | [prev] | [next] | [standalone]

#14461

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-27 08:56 -0700
Message-ID	<7xy5m5mc4j.fsf@ruckus.brouhaha.com>
In reply to	#14460

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> If the problem in your benchmark is that it is memory bandwidth
> limited, no, that problem is not "fixed". 

Yeah I wouldn't have expected memory bandwidth to be the problem either,
in a sequential-access problem like dot product.  One simple test is
access all the data before beginning DDOT, to get it all into cache.
There are also some cache prefetch instructions in the x86 that let you
get the data into cache ahead of time, when you know what the access
pattern is going to be.

[toc] | [prev] | [next] | [standalone]

#14463

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2012-07-27 16:28 +0000
Message-ID	<2012Jul27.182801@mips.complang.tuwien.ac.at>
In reply to	#14461

Paul Rubin <no.email@nospam.invalid> writes:
>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> If the problem in your benchmark is that it is memory bandwidth
>> limited, no, that problem is not "fixed". 
>
>Yeah I wouldn't have expected memory bandwidth to be the problem either,
>in a sequential-access problem like dot product.

It can be, if your data does not fit into the cache.

E.g., if you have 2 channels with DDR3-12800 memory, you get at best
25600MB/s of memory bandwidth, if your CPU core is capable of fetching
2 8-byte floats per cycle and of doing one F+ and one F* per cycle and
has 3GHz clock speed, that would be 2*8*3000=48000MB/s bandwidth
requirement; and that's just one core.

>One simple test is
>access all the data before beginning DDOT, to get it all into cache.
>There are also some cache prefetch instructions in the x86 that let you
>get the data into cache ahead of time, when you know what the access
>pattern is going to be.

For a fixed-stride access pattern the hardware prefetcher should be
able to cover that nicely, no software prefetching needed.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/

[toc] | [prev] | [next] | [standalone]

#14476

From	David Kuehling <dvdkhlng@gmx.de>
Date	2012-07-28 01:47 +0200
Message-ID	<87k3xohile.fsf@snail.Pool>
In reply to	#14461

>>>>> "Paul" == Paul Rubin <no.email@nospam.invalid> writes:

> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> If the problem in your benchmark is that it is memory bandwidth
>> limited, no, that problem is not "fixed".

> Yeah I wouldn't have expected memory bandwidth to be the problem
> either, in a sequential-access problem like dot product.  One simple
> test is access all the data before beginning DDOT, to get it all into
> cache.  There are also some cache prefetch instructions in the x86
> that let you get the data into cache ahead of time, when you know what
> the access pattern is going to be.

I'd suggest to use n*m dot-products to compute a product of a kxn with a
mxk matrix.  This will be closer to some real-world use with (given that
k is small enough) most data accesses going to the caches.

David
-- 
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk.gpg
Fingerprint: B17A DC95 D293 657B 4205 D016 7DEF 5323 C174 7D40

[toc] | [prev] | [next] | [standalone]

#14558

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-31 00:28 +0200
Message-ID	<62070304968435@frunobulax.edu>
In reply to	#14476

David Kuehling <dvdkhlng@gmx.de> writes Re: SSE2

[..]

> I'd suggest to use n*m dot-products to compute a product of a kxn with a
> mxk matrix.  This will be closer to some real-world use with (given that
> k is small enough) most data accesses going to the caches.

I suppose you mean "gaxpy," not "dot," do you?

Golub and Van Loan's 'Matrix Computations' has some gems once you know what
to look for. It is possible to specify matrix multiplication in dot and
gaxpy operations (six equivalent possibilities). There is one that 
accesses data strictly row by row (!), and it is based on gaxpy. I used it
for the below iForth results.

I have now implemented DAXPY and SAXPY for (64-bit mode) SSE2, and use it 
to do inproducts and matrix multiplications for small and large vectors. 
The result is shown below. 

Summary: SSE2 does not pay off for small single-precision matrix 
multiplication (a simple pointer may win). In all other cases, it
wins by sometimes a very wide margin.

I changed the code of a previous posting to be fully general, that is, all
routines (mul, mul1, ... ) now work for any size (not only for multiples 
of 4 or 16). Therefore the shown timings are different.

Thanks for the trigger!

-marcel

-- --------------------------------------------------------------
	FORTH> 1000000 TEST-DOT
	DOT using 32 bits floats.
	Vector size = 1024
	mul          :  5.2377600000000000000e+0011 1.100 seconds elapsed.
	mul1         :  5.2377600000000000000e+0011 1.820 seconds elapsed.
	mul2         :  5.2377600000000000000e+0011 1.612 seconds elapsed.
	mul3  (SSE2) :  5.2377600000000000000e+0011 0.187 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

	FORTH> 100000 TEST-AXPY
	AXPY using 32 bits floats.
	Vector size = 1024
	mmul         :  1.0828195840000000000e+0014 0.569 seconds elapsed.
	mmul1        :  1.0828195840000000000e+0014 0.874 seconds elapsed.
	mmul2        :  1.0828195840000000000e+0014 1.001 seconds elapsed.
	mmul3 (SSE2) :  1.0828195840000000000e+0014 0.287 seconds elapsed.
	mmul4 (SSE2) :  6.7019493475742391600e+0017 0.230 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

	FORTH> 1000000 TEST-DOT
	DOT using 64 bits floats.
	Vector size = 1024
	mul          :  5.2377600000000000000e+0011 1.089 seconds elapsed.
	mul1         :  5.2377600000000000000e+0011 1.813 seconds elapsed.
	mul2         :  5.2377600000000000000e+0011 1.627 seconds elapsed.
	mul3  (SSE2) :  5.2377600000000000000e+0011 0.365 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

	FORTH> 100000 TEST-AXPY
	AXPY using 64 bits floats.
	Vector size = 1024
	mmul         :  1.0828195840000000000e+0014 0.569 seconds elapsed.
	mmul1        :  1.0828195840000000000e+0014 0.869 seconds elapsed.
	mmul2        :  1.0828195840000000000e+0014 1.000 seconds elapsed.
	mmul3 (SSE2) :  1.0828195840000000000e+0014 0.343 seconds elapsed.
	mmul4 (SSE2) :  6.7015868170240000000e+0017 0.283 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

	FORTH> 100000000 TEST-DOT
	DOT using 64 bits floats.
	Vector size = 16
	mul          :  1.2000000000000000000e+0010 2.121 seconds elapsed.
	mul1         :  1.2000000000000000000e+0010 3.330 seconds elapsed.
	mul2         :  1.2000000000000000000e+0010 4.689 seconds elapsed.
	mul3  (SSE2) :  1.2000000000000000000e+0010 1.602 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

	FORTH> 100000000 TEST-AXPY
	AXPY using 64 bits floats.
	Vector size = 16
	mmul         :  4.9000000000000000000e+0010 3.708 seconds elapsed.
	mmul1        :  4.9000000000000000000e+0010 4.332 seconds elapsed.
	mmul2        :  4.9000000000000000000e+0010 8.950 seconds elapsed.
	mmul3 (SSE2) :  4.9000000000000000000e+0010 4.304 seconds elapsed.
	mmul4 (SSE2) :  2.8750007200000000000e+0017 3.104 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

	FORTH> 10000000 TEST-DOT
	DOT using 32 bits floats.
	Vector size = 16
	mul          :  1.2000000000000000000e+0009 0.230 seconds elapsed.
	mul1         :  1.2000000000000000000e+0009 0.337 seconds elapsed.
	mul2         :  1.2000000000000000000e+0009 0.514 seconds elapsed.
	mul3  (SSE2) :  1.2000000000000000000e+0009 0.144 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mul3 may be wrong. ok

	FORTH> 10000000 TEST-AXPY
	AXPY using 32 bits floats.
	Vector size = 16
	mmul         :  4.9000000000000000000e+0009 0.391 seconds elapsed.
	mmul1        :  4.9000000000000000000e+0009 0.430 seconds elapsed.
	mmul2        :  4.9000000000000000000e+0009 0.965 seconds elapsed.
	mmul3 (SSE2) :  4.9000000000000000000e+0009 0.445 seconds elapsed.
	mmul4 (SSE2) :  2.8791162680142440000e+0015 0.316 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout for mmul3 may be wrong. ok

[toc] | [prev] | [next] | [standalone]

#14604

From	David Kuehling <dvdkhlng@gmx.de>
Date	2012-08-01 11:27 +0200
Message-ID	<87k3xj2c9m.fsf@mosquito.pool>
In reply to	#14558

>>>>> "Marcel" == Marcel Hendrix <mhx@iae.nl> writes:

> David Kuehling <dvdkhlng@gmx.de> writes Re: SSE2 [..]

>> I'd suggest to use n*m dot-products to compute a product of a kxn
>> with a mxk matrix.  This will be closer to some real-world use with
>> (given that k is small enough) most data accesses going to the
>> caches.

> I suppose you mean "gaxpy," not "dot," do you?

I didn't really think this through to the end I guss.  Last time I did
SIMD, it was on the Cell CPU and I just made sure that data was
formatted in a way that a 4-way parallel (single-precision) dot product
made sense (i.e. everything, all matrices etc. were stored 4-way
parallel).

> Golub and Van Loan's 'Matrix Computations' has some gems once you know
> what to look for. It is possible to specify matrix multiplication in
> dot and gaxpy operations (six equivalent possibilities). There is one
> that accesses data strictly row by row (!), and it is based on
> gaxpy. I used it for the below iForth results.

Quite some time since I last had a look at the Goblub.  Looks like I
completely missed the part you meniton...

> I have now implemented DAXPY and SAXPY for (64-bit mode) SSE2, and use
> it to do inproducts and matrix multiplications for small and large
> vectors.  The result is shown below.

> Summary: SSE2 does not pay off for small single-precision matrix
> multiplication (a simple pointer may win). In all other cases, it wins
> by sometimes a very wide margin.

For the Cell CPU I got quite some performance gains by implementing loop
pipelining, i.e. fetching the data for the next loop iteration already
one loop iteration ahead, with first and last iteration peeled off the
loop to handle the special cases, and the loop body duplicated to
implement the double-buffering prefetch via two sets of registers.  I
guess with speculatively executing modern CPUs that may not bring any
performance, though (unless you plan to run your code on an Intel Atom
CPU :)

Actually I did some more optimizations to manage the excessive floating
point latencies on the cell.  On cell the floating point add is 6 cycles
latency, so however fast your multiplication step is, you won't be
faster than one iteration per 6 cycles when adding to the same
accumulator once per iteration.  I countered that by computing 8 dot
products in parallel, you may also just use N accumulators that are
added together when done (and unrolling N iterations of the loop body).

> I changed the code of a previous posting to be fully general, that is,
> all routines (mul, mul1, ... ) now work for any size (not only for
> multiples of 4 or 16). Therefore the shown timings are different.

> Thanks for the trigger!

BTW, just wondering, why aren't you just linking with libatlas and use
the BLAS api?

cheers,

David
-- 
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk.gpg
Fingerprint: B17A DC95 D293 657B 4205  D016 7DEF 5323 C174 7D40

[toc] | [prev] | [next] | [standalone]

#14622

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-08-01 20:23 +0200
Message-ID	<60121502958435@frunobulax.edu>
In reply to	#14604

David Kuehling <dvdkhlng@gmx.de> writes Re: SSE2

>>>>>> "Marcel" == Marcel Hendrix <mhx@iae.nl> writes:

>> David Kuehling <dvdkhlng@gmx.de> writes Re: SSE2 [..]
[..]
>> Summary: SSE2 does not pay off for small single-precision matrix
>> multiplication (a simple pointer may win). In all other cases, it wins
>> by sometimes a very wide margin.

> For the Cell CPU I got quite some performance gains by implementing loop
> pipelining, i.e. fetching the data for the next loop iteration already
> one loop iteration ahead, with first and last iteration peeled off the
> loop to handle the special cases, and the loop body duplicated to
> implement the double-buffering prefetch via two sets of registers.  I
> guess with speculatively executing modern CPUs that may not bring any
> performance, though (unless you plan to run your code on an Intel Atom
> CPU :)

This SSE2 code uses 8 XMMM registers, with the aligned accesses to memory  
interleaved between the addpd and mulpd instructions to prevent stalls.
The intermediate results go to 6 accumulators which are picked apart and
added outside of the main loop. The back jump takes insignificant time 
compared to the work done. There should not be many wasted cycles
there.

The most difficult thing to do was handling the cases where the number
of vector elements does not match the unrolled loop size. (Of course,
that is trivial in high-level Forth.)

Hmm, I see my matrix code does not yet work for odd rowsize (because
then a row address may become a multiple of 8, not 16, as required)

> Actually I did some more optimizations to manage the excessive floating
> point latencies on the cell.  On cell the floating point add is 6 cycles
> latency, so however fast your multiplication step is, you won't be
> faster than one iteration per 6 cycles when adding to the same
> accumulator once per iteration.  I countered that by computing 8 dot
> products in parallel, you may also just use N accumulators that are
> added together when done (and unrolling N iterations of the loop body).

Yes, that seems very similar to SSE2, although the numbers differ.

[..]

> BTW, just wondering, why aren't you just linking with libatlas and use
> the BLAS api?

I have that too (the ACML)! Somewhat to my initial surprise, that code
runs again 2* faster for midsize vectors (N=120) and even 4 times faster
for N=500 (about 21 GFLOPS). Apparently the 4 cores are all put to work.

The reason for my own tries is 

  a) curiosity 

  b) it is convenient not having to maintain too many 32/64bit MinGW projects 
     that should work on Windows, Linux and OSX, 

  c) For the very small vectors that I use everyday:

FORTH> SPECIAL-TEST4
 4x4  mm - iForth DAXPY based   474.24 MFlops,   6.32 ticks/flop,   0.000 us
 6x6  mm - iForth DAXPY based   927.40 MFlops,   3.23 ticks/flop,   0.000 us
 8x8  mm - iForth DAXPY based  1089.43 MFlops,   2.75 ticks/flop,   0.000 us
10x10 mm - iForth DAXPY based  1272.48 MFlops,   2.35 ticks/flop,   1.000 us
12x12 mm - iForth DAXPY based  1520.02 MFlops,   1.97 ticks/flop,   2.000 us
14x14 mm - iForth DAXPY based  1683.61 MFlops,   1.78 ticks/flop,   3.000 us
16x16 mm - iForth DAXPY based  1897.78 MFlops,   1.58 ticks/flop,   4.000 us
32x32 mm - iForth DAXPY based  2881.72 MFlops,   1.04 ticks/flop,  22.000 us

where the best ACML results are:

FORTH> SPECIAL-TEST2
 4x4  mm - iForth DGEMM1        411.91 MFlops,   7.28 ticks/flop,   0.000 us
 6x6  mm - iForth DGEMM1        814.29 MFlops,   3.68 ticks/flop,   0.000 us
 8x8  mm - iForth DGEMM1        193.67 MFlops,  15.48 ticks/flop,   5.000 us
10x10 mm - iForth DGEMM1        353.35 MFlops,   8.49 ticks/flop,   5.000 us
12x12 mm - iForth DGEMM1        600.39 MFlops,   4.99 ticks/flop,   5.000 us
14x14 mm - iForth DGEMM1        845.41 MFlops,   3.54 ticks/flop,   6.000 us
16x16 mm - iForth DGEMM1       1208.30 MFlops,   2.48 ticks/flop,   6.000 us
32x32 mm - iForth DGEMM1       3552.92 MFlops,   0.84 ticks/flop,  18.000 us ok

-marcel

[toc] | [prev] | [next] | [standalone]

#14748

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-08-05 21:00 +0200
Message-ID	<01031498958435@frunobulax.edu>
In reply to	#14604

mhx@iae.nl (Marcel Hendrix) writes Re: SSE2
[..]

> Hmm, I see my matrix code does not yet work for odd rowsize (because
> then a row address may become a multiple of 8, not 16, as required)

I have now fixed this, and guess what: although I use movups (unaligned 
fetch) instead of movaps (aligned fetch) everywhere, the speed drops 
less than 5 - 10%. Interesting. This so close to optimal that I didn't
special case aligned access for the iForth kernel words (see the final
results below). 

For the new tests, I only compare general purpose words -- the 
high-level Forth and SSE replacements are fully general (work 
aligned / unaligned, for any size, and no buffer is used to keep 
transposed matrices).

The main conclusion is again that SSE2 has advantages, but for short 
double vectors you won't see it.

-marcel

	=== ( singles ) ===

	FORTH> TESTS
	DOT/AXPY using 32 bits floats.
	Vector size = 16
	mul0 (dot)        :  1.2000000000000000000e+0009 0.226 seconds elapsed.
	mul1 (dot_sse2)   :  1.2000000000000000000e+0009 0.136 seconds elapsed.
	mmul0 (axpy)      :  6.0000000000000000000e+0008 0.358 seconds elapsed.
	mmul1 (axpy_sse2) :  6.0000000000000000000e+0008 0.341 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout may be wrong. ok

	FORTH> TESTS
	DOT/AXPY using 32 bits floats.
	Vector size = 1024
	mul0 (dot)        :  5.2377600000000000000e+0011 1.088 seconds elapsed.
	mul1 (dot_sse2)   :  5.2377600000000000000e+0011 0.189 seconds elapsed.
	mmul0 (axpy)      :  2.0951040000000000000e+0012 5.648 seconds elapsed.
	mmul1 (axpy_sse2) :  2.0951040000000000000e+0012 2.329 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout may be wrong. ok

	=== ( doubles ) ===

	FORTH> TESTS
	DOT/AXPY using 64 bits floats.
	Vector size = 16
	mul0 (dot)        :  1.2000000000000000000e+0009 0.237 seconds elapsed.
	mul1 (dot_sse2)   :  1.2000000000000000000e+0009 0.146 seconds elapsed.
	mmul0 (axpy)      :  6.0000000000000000000e+0008 0.435 seconds elapsed.
	mmul1 (axpy_sse2) :  6.0000000000000000000e+0008 0.349 seconds elapsed. ok

	FORTH> TESTS
	DOT/AXPY using 64 bits floats.
	Vector size = 1024
	mul0 (dot)        :  5.2377600000000000000e+0011 1.095 seconds elapsed.
	mul1 (dot_sse2)   :  5.2377600000000000000e+0011 0.370 seconds elapsed.
	mmul0 (axpy)      :  2.0951040000000000000e+0012 5.718 seconds elapsed.
	mmul1 (axpy_sse2) :  2.0951040000000000000e+0012 3.080 seconds elapsed. ok

	=== mm_old.frt ===

	500x500 mm - normal algorithm                       0.290 secs.
	500x500 mm - temporary variable in loop             0.439 secs.
	500x500 mm - unrolled inner loop, factor of 4       0.321 secs.
	500x500 mm - unrolled inner loop, factor of 8       0.294 secs.
	500x500 mm - unrolled inner loop, factor of 16      0.280 secs.
	500x500 mm - pointers used to access matrices       0.350 secs.
	500x500 mm - pointers used, unrolled by 8           0.258 secs.
	500x500 mm - transposed B matrix                    0.400 secs.
	500x500 mm - interchanged inner loops               0.423 secs.
	500x500 mm - blocking, step size of 20              0.457 secs.
	500x500 mm - Robert's algorithm                     0.064 secs.
	500x500 mm - T. Maeno's algorithm, subarray 20x20   0.372 secs.
	500x500 mm - Generic Maeno, subarray 20x20          0.392 secs.
	500x500 mm - D. Warner's algorithm, subarray 20x20  0.372 secs.
	500x500 mm - SSE2                                   0.064 secs.
	========================================================= =====
	Total using no extensions and using no hackery      4.776 secs. ok

[toc] | [prev] | [next] | [standalone]

#14481

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-28 15:28 +0200
Message-ID	<81078807968435@frunobulax.edu>
In reply to	#14460

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: SSE2

> mhx@iae.nl (Marcel Hendrix) writes:
>>What I did is derived from the MiniBLAS sources. As SSE2 operates on 
>>4 doubles at a time, speedups of around 4 are suggesting themselves. 
>>However, I can find no trace of this. An obvious reason could be that
>>memory throughput can not keep up with the FP units. Strange, as one 
>>would expect this hardware problem to be fixed by now.

> If the problem in your benchmark is that it is memory bandwidth
> limited, no, that problem is not "fixed". 

See my answer to Paul. The problem was that the S/DDOT code needs
large vectors to become effective. For small sizes it does almost
nothing. 

[,,]

> Hmm, if the problem is memory bandwidth, I would expect all variants
> to have the same performance (unless you use additional indirection
> vectors or varying memory layouts).

That's a good argument.

[..]
> My _guess_ is that your compiler produces some stack memory stores for
> some of the variants, with some stack memory fetches soon after, and
> these cost quie a bit of performance.  At least they used to.

I would be interested to read more about this. What is the problem here,
exactly? Is a stackframe better than push/pop?

>                                                                You
> don't have a data-near-code problem, do you?

No, I allways check for these, but I have not come across anything 
like it since quite a long time.

-marcel

[toc] | [prev] | [next] | [standalone]

#14577

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2012-07-31 14:29 +0000
Message-ID	<2012Jul31.162943@mips.complang.tuwien.ac.at>
In reply to	#14481

mhx@iae.nl (Marcel Hendrix) writes:
>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: SSE2
>> My _guess_ is that your compiler produces some stack memory stores for
>> some of the variants, with some stack memory fetches soon after, and
>> these cost quie a bit of performance.  At least they used to.
>
>I would be interested to read more about this. What is the problem here,
>exactly? Is a stackframe better than push/pop?

At least in earlier CPUs a dependency through memory had a long
latency, like 15 cycles or so (and all stuff that went through memory
was affected): the data went in the store buffer and eventually went
out to cache, and only then the load could access it for real.  I
think newer CPUs have added a bypass (store-to-load-forwarding) to
make such cases faster; or maybe they just made the case faster where
the address of the store is different from the address of the load
(i.e., no bypass).

A benchmark I used was a simple counted loop where the counter was in
memory.  IIRC this needed about 15 cycles/iteration or so on Athlon
and Pentium 3.

Forth systems that are not very sophisticated (or for sophisticated
compilers, a fallback option) might storing stack items to the
stack-in memory and soon after load that iterm from the stack, and
so might run into this problem.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/

[toc] | [prev] | [next] | [standalone]

#14462

From	albert@cherry.spenarnc.xs4all.nl (Albert van der Horst)
Date	2012-07-27 16:31 +0000
Message-ID	<5012c253$0$6982$e4fe514c@news2.news.xs4all.nl>
In reply to	#14457

In article <56151908968435@frunobulax.edu>, Marcel Hendrix <mhx@iae.nl> wrote:
>I had a fresh look at SSE2. Strangely, my tests fail to show up any 
>performance difference with the good ol' FPU.
>
>The test I did is based on an SSE2 DDOT implementation (DDOT_sse2, vector
>multiplication). I won't show the code, it is too long and specific. 
>
>What I did is derived from the MiniBLAS sources. As SSE2 operates on 
>4 doubles at a time, speedups of around 4 are suggesting themselves. 
>However, I can find no trace of this. An obvious reason could be that
>memory throughput can not keep up with the FP units. Strange, as one 
>would expect this hardware problem to be fixed by now.
>
>Does anybody know if there is (currently) a hardware limitation in 
>i7 920 CPU's? I vaguely remember having read somewhere that some corners
>are cut because not everybody is using 64bit software yet, and obviously
>clock speed 'sells' better than actual FP performance.
>
>	FORTH> 100000000 TEST
>	Using 64 bits floats.
>	mul   :  1.2400000000000000000e+0011 2.055 seconds elapsed.
>	mul1  :  1.2400000000000000000e+0011 3.141 seconds elapsed.
>	mul2  :  1.2400000000000000000e+0011 1.230 seconds elapsed.
>	mul3  :  1.2400000000000000000e+0011 1.604 seconds elapsed.
>	mmul  :  4.6000000000000000000e+0010 2.376 seconds elapsed.
>	mmul1 :  4.6000000000000000000e+0010 2.854 seconds elapsed.
>	mmul3 :  4.6000000000000000000e+0010 3.785 seconds elapsed.
>	mmul4 :  4.6000000000000000000e+0010 2.933 seconds elapsed. ok

I miss something here. What of this is FPU/stack based and what is
SSE-based?

<SNIP>

Groetjes Albert

-- 

-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [standalone]

csiph-web

SSE2

Contents

#14457 — SSE2

#14459

#14480

#14460

#14461

#14463

#14476

#14558

#14604

#14622

#14748

#14481

#14577

#14462