Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.forth > #14748

Re: SSE2

From mhx@iae.nl (Marcel Hendrix)
Subject Re: SSE2
Newsgroups comp.lang.forth
Message-ID <01031498958435@frunobulax.edu> (permalink)
Date 2012-08-05 21:00 +0200
References <87k3xj2c9m.fsf@mosquito.pool>
Organization Wanadoo

Show all headers | View raw


mhx@iae.nl (Marcel Hendrix) writes Re: SSE2
[..]

> Hmm, I see my matrix code does not yet work for odd rowsize (because
> then a row address may become a multiple of 8, not 16, as required)

I have now fixed this, and guess what: although I use movups (unaligned 
fetch) instead of movaps (aligned fetch) everywhere, the speed drops 
less than 5 - 10%. Interesting. This so close to optimal that I didn't
special case aligned access for the iForth kernel words (see the final
results below). 

For the new tests, I only compare general purpose words -- the 
high-level Forth and SSE replacements are fully general (work 
aligned / unaligned, for any size, and no buffer is used to keep 
transposed matrices).

The main conclusion is again that SSE2 has advantages, but for short 
double vectors you won't see it.

-marcel

	=== ( singles ) ===

	FORTH> TESTS
	DOT/AXPY using 32 bits floats.
	Vector size = 16
	mul0 (dot)        :  1.2000000000000000000e+0009 0.226 seconds elapsed.
	mul1 (dot_sse2)   :  1.2000000000000000000e+0009 0.136 seconds elapsed.
	mmul0 (axpy)      :  6.0000000000000000000e+0008 0.358 seconds elapsed.
	mmul1 (axpy_sse2) :  6.0000000000000000000e+0008 0.341 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout may be wrong. ok

	FORTH> TESTS
	DOT/AXPY using 32 bits floats.
	Vector size = 1024
	mul0 (dot)        :  5.2377600000000000000e+0011 1.088 seconds elapsed.
	mul1 (dot_sse2)   :  5.2377600000000000000e+0011 0.189 seconds elapsed.
	mmul0 (axpy)      :  2.0951040000000000000e+0012 5.648 seconds elapsed.
	mmul1 (axpy_sse2) :  2.0951040000000000000e+0012 2.329 seconds elapsed.
	Note: SINGLE maxint == 16777217, printout may be wrong. ok

	=== ( doubles ) ===

	FORTH> TESTS
	DOT/AXPY using 64 bits floats.
	Vector size = 16
	mul0 (dot)        :  1.2000000000000000000e+0009 0.237 seconds elapsed.
	mul1 (dot_sse2)   :  1.2000000000000000000e+0009 0.146 seconds elapsed.
	mmul0 (axpy)      :  6.0000000000000000000e+0008 0.435 seconds elapsed.
	mmul1 (axpy_sse2) :  6.0000000000000000000e+0008 0.349 seconds elapsed. ok

	FORTH> TESTS
	DOT/AXPY using 64 bits floats.
	Vector size = 1024
	mul0 (dot)        :  5.2377600000000000000e+0011 1.095 seconds elapsed.
	mul1 (dot_sse2)   :  5.2377600000000000000e+0011 0.370 seconds elapsed.
	mmul0 (axpy)      :  2.0951040000000000000e+0012 5.718 seconds elapsed.
	mmul1 (axpy_sse2) :  2.0951040000000000000e+0012 3.080 seconds elapsed. ok

	=== mm_old.frt ===

	500x500 mm - normal algorithm                       0.290 secs.
	500x500 mm - temporary variable in loop             0.439 secs.
	500x500 mm - unrolled inner loop, factor of 4       0.321 secs.
	500x500 mm - unrolled inner loop, factor of 8       0.294 secs.
	500x500 mm - unrolled inner loop, factor of 16      0.280 secs.
	500x500 mm - pointers used to access matrices       0.350 secs.
	500x500 mm - pointers used, unrolled by 8           0.258 secs.
	500x500 mm - transposed B matrix                    0.400 secs.
	500x500 mm - interchanged inner loops               0.423 secs.
	500x500 mm - blocking, step size of 20              0.457 secs.
	500x500 mm - Robert's algorithm                     0.064 secs.
	500x500 mm - T. Maeno's algorithm, subarray 20x20   0.372 secs.
	500x500 mm - Generic Maeno, subarray 20x20          0.392 secs.
	500x500 mm - D. Warner's algorithm, subarray 20x20  0.372 secs.
	500x500 mm - SSE2                                   0.064 secs.
	========================================================= =====
	Total using no extensions and using no hackery      4.776 secs. ok

Back to comp.lang.forth | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-27 16:20 +0200
  Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:03 -0700
    Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
  Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 14:56 +0000
    Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:56 -0700
      Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 16:28 +0000
      Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-07-28 01:47 +0200
        Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-31 00:28 +0200
          Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-08-01 11:27 +0200
            Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-01 20:23 +0200
            Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-05 21:00 +0200
    Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
      Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-31 14:29 +0000
  Re: SSE2 albert@cherry.spenarnc.xs4all.nl (Albert van der Horst) - 2012-07-27 16:31 +0000

csiph-web