Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #14604
| From | David Kuehling <dvdkhlng@gmx.de> |
|---|---|
| Newsgroups | comp.lang.forth |
| Subject | Re: SSE2 |
| Date | 2012-08-01 11:27 +0200 |
| Message-ID | <87k3xj2c9m.fsf@mosquito.pool> (permalink) |
| References | <87k3xohile.fsf@snail.Pool> <62070304968435@frunobulax.edu> |
>>>>> "Marcel" == Marcel Hendrix <mhx@iae.nl> writes: > David Kuehling <dvdkhlng@gmx.de> writes Re: SSE2 [..] >> I'd suggest to use n*m dot-products to compute a product of a kxn >> with a mxk matrix. This will be closer to some real-world use with >> (given that k is small enough) most data accesses going to the >> caches. > I suppose you mean "gaxpy," not "dot," do you? I didn't really think this through to the end I guss. Last time I did SIMD, it was on the Cell CPU and I just made sure that data was formatted in a way that a 4-way parallel (single-precision) dot product made sense (i.e. everything, all matrices etc. were stored 4-way parallel). > Golub and Van Loan's 'Matrix Computations' has some gems once you know > what to look for. It is possible to specify matrix multiplication in > dot and gaxpy operations (six equivalent possibilities). There is one > that accesses data strictly row by row (!), and it is based on > gaxpy. I used it for the below iForth results. Quite some time since I last had a look at the Goblub. Looks like I completely missed the part you meniton... > I have now implemented DAXPY and SAXPY for (64-bit mode) SSE2, and use > it to do inproducts and matrix multiplications for small and large > vectors. The result is shown below. > Summary: SSE2 does not pay off for small single-precision matrix > multiplication (a simple pointer may win). In all other cases, it wins > by sometimes a very wide margin. For the Cell CPU I got quite some performance gains by implementing loop pipelining, i.e. fetching the data for the next loop iteration already one loop iteration ahead, with first and last iteration peeled off the loop to handle the special cases, and the loop body duplicated to implement the double-buffering prefetch via two sets of registers. I guess with speculatively executing modern CPUs that may not bring any performance, though (unless you plan to run your code on an Intel Atom CPU :) Actually I did some more optimizations to manage the excessive floating point latencies on the cell. On cell the floating point add is 6 cycles latency, so however fast your multiplication step is, you won't be faster than one iteration per 6 cycles when adding to the same accumulator once per iteration. I countered that by computing 8 dot products in parallel, you may also just use N accumulators that are added together when done (and unrolling N iterations of the loop body). > I changed the code of a previous posting to be fully general, that is, > all routines (mul, mul1, ... ) now work for any size (not only for > multiples of 4 or 16). Therefore the shown timings are different. > Thanks for the trigger! BTW, just wondering, why aren't you just linking with libatlas and use the BLAS api? cheers, David -- GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk.gpg Fingerprint: B17A DC95 D293 657B 4205 D016 7DEF 5323 C174 7D40
Back to comp.lang.forth | Previous | Next — Previous in thread | Next in thread | Find similar
SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-27 16:20 +0200
Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:03 -0700
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 14:56 +0000
Re: SSE2 Paul Rubin <no.email@nospam.invalid> - 2012-07-27 08:56 -0700
Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-27 16:28 +0000
Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-07-28 01:47 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-31 00:28 +0200
Re: SSE2 David Kuehling <dvdkhlng@gmx.de> - 2012-08-01 11:27 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-01 20:23 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-08-05 21:00 +0200
Re: SSE2 mhx@iae.nl (Marcel Hendrix) - 2012-07-28 15:28 +0200
Re: SSE2 anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-31 14:29 +0000
Re: SSE2 albert@cherry.spenarnc.xs4all.nl (Albert van der Horst) - 2012-07-27 16:31 +0000
csiph-web