Re: SSE2

From	David Kuehling <dvdkhlng@gmx.de>
Newsgroups	comp.lang.forth
Subject	Re: SSE2
Date	2012-08-01 11:27 +0200
Message-ID	<87k3xj2c9m.fsf@mosquito.pool> (permalink)
References	<87k3xohile.fsf@snail.Pool> <62070304968435@frunobulax.edu>

Show all headers | View raw

>>>>> "Marcel" == Marcel Hendrix <mhx@iae.nl> writes:

> David Kuehling <dvdkhlng@gmx.de> writes Re: SSE2 [..]

>> I'd suggest to use n*m dot-products to compute a product of a kxn
>> with a mxk matrix.  This will be closer to some real-world use with
>> (given that k is small enough) most data accesses going to the
>> caches.

> I suppose you mean "gaxpy," not "dot," do you?

I didn't really think this through to the end I guss.  Last time I did
SIMD, it was on the Cell CPU and I just made sure that data was
formatted in a way that a 4-way parallel (single-precision) dot product
made sense (i.e. everything, all matrices etc. were stored 4-way
parallel).

> Golub and Van Loan's 'Matrix Computations' has some gems once you know
> what to look for. It is possible to specify matrix multiplication in
> dot and gaxpy operations (six equivalent possibilities). There is one
> that accesses data strictly row by row (!), and it is based on
> gaxpy. I used it for the below iForth results.

Quite some time since I last had a look at the Goblub.  Looks like I
completely missed the part you meniton...

> I have now implemented DAXPY and SAXPY for (64-bit mode) SSE2, and use
> it to do inproducts and matrix multiplications for small and large
> vectors.  The result is shown below.

> Summary: SSE2 does not pay off for small single-precision matrix
> multiplication (a simple pointer may win). In all other cases, it wins
> by sometimes a very wide margin.

For the Cell CPU I got quite some performance gains by implementing loop
pipelining, i.e. fetching the data for the next loop iteration already
one loop iteration ahead, with first and last iteration peeled off the
loop to handle the special cases, and the loop body duplicated to
implement the double-buffering prefetch via two sets of registers.  I
guess with speculatively executing modern CPUs that may not bring any
performance, though (unless you plan to run your code on an Intel Atom
CPU :)

Actually I did some more optimizations to manage the excessive floating
point latencies on the cell.  On cell the floating point add is 6 cycles
latency, so however fast your multiplication step is, you won't be
faster than one iteration per 6 cycles when adding to the same
accumulator once per iteration.  I countered that by computing 8 dot
products in parallel, you may also just use N accumulators that are
added together when done (and unrolling N iterations of the loop body).

> I changed the code of a previous posting to be fully general, that is,
> all routines (mul, mul1, ... ) now work for any size (not only for
> multiples of 4 or 16). Therefore the shown timings are different.

> Thanks for the trigger!

BTW, just wondering, why aren't you just linking with libatlas and use
the BLAS api?

cheers,

David
-- 
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk.gpg
Fingerprint: B17A DC95 D293 657B 4205  D016 7DEF 5323 C174 7D40

Back to comp.lang.forth | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-27 16:20 +0200
  Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:03 -0700
    Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
  Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 14:56 +0000
    Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:56 -0700
      Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 16:28 +0000
      Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-07-28 01:47 +0200
        Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-31 00:28 +0200
          Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-08-01 11:27 +0200
            Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-01 20:23 +0200
            Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-05 21:00 +0200
    Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
      Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-31 14:29 +0000
  Re: SSE2 albert@cherry.spenarnc.xs4all.nl (Albert van der Horst) - 2012-07-27 16:31 +0000

csiph-web