Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #13587 > unrolled thread
| Started by | Arnold Doray <invalid@invalid.com> |
|---|---|
| First post | 2012-07-05 07:13 +0000 |
| Last post | 2012-07-25 14:08 +0000 |
| Articles | 20 on this page of 53 — 10 participants |
Back to article view | Back to comp.lang.forth
Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-05 07:13 +0000
Re: Permutation Tensor "A. K." <akk@nospam.org> - 2012-07-05 09:25 +0200
Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-05 08:42 +0000
Re: Permutation Tensor "A. K." <akk@nospam.org> - 2012-07-05 11:25 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-06 18:49 -0700
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 08:32 +0200
Re: Permutation Tensor "A. K." <akk@nospam.org> - 2012-07-07 09:49 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 05:35 -0700
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 15:45 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 07:32 -0700
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 17:53 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 09:10 -0700
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 18:41 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 12:24 -0700
Re: Permutation Tensor Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2012-07-08 07:48 +0100
Re: Permutation Tensor Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2012-07-08 08:09 +0100
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-08 09:23 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-08 05:55 -0700
Re: Permutation Tensor Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2012-07-08 20:14 +0100
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-08 13:49 -0700
Re: Permutation Tensor Paul Rubin <no.email@nospam.invalid> - 2012-07-07 13:03 -0700
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 23:54 +0200
Re: Permutation Tensor Paul Rubin <no.email@nospam.invalid> - 2012-07-07 16:43 -0700
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-08 09:07 +0200
Re: Permutation Tensor Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-08 13:03 +0000
Re: Permutation Tensor anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-09 14:50 +0000
Re: Permutation Tensor m.a.m.hendrix@tue.nl - 2012-07-10 00:28 -0700
Re: Permutation Tensor anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-10 10:11 +0000
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-10 21:37 +0200
Re: Permutation Tensor anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-11 13:50 +0000
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-10 05:10 -0700
Re: Permutation Tensor Paul Rubin <no.email@nospam.invalid> - 2012-07-06 20:46 -0700
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 05:36 -0700
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-06 18:18 -0700
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-06 18:33 -0700
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 07:28 -0700
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 08:02 -0700
Re: Permutation Tensor Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-07 19:58 +0000
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-08 05:45 -0700
Re: Permutation Tensor Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-08 20:05 +0000
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-08 14:24 +0200
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-11 07:09 +0200
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-11 18:42 +0200
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-11 20:18 +0200
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-12 12:43 +0200
Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-25 14:07 +0000
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-27 13:42 +0200
Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-27 15:39 +0200
Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-30 04:52 +0000
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-08 23:44 +0200
Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-10 17:06 -0700
Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-11 07:09 +0200
Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-25 14:08 +0000
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-07 13:03 -0700 |
| Message-ID | <7xzk7bxrsy.fsf@ruckus.brouhaha.com> |
| In reply to | #13674 |
mhx@iae.nl (Marcel Hendrix) writes: > Nice register based code (when using the iForth PARAMS|), but the > multiplications apparently slow it down (eps). With Standard LOCALS| the > code is prepostorously slow (eps2). That's kind of interesting about the multiplications, on a modern machine. Can you change them to additions (so it will give the wrong answer) and see if it's faster? Thanks.
[toc] | [prev] | [next] | [standalone]
| From | mhx@iae.nl (Marcel Hendrix) |
|---|---|
| Date | 2012-07-07 23:54 +0200 |
| Message-ID | <60811296968435@frunobulax.edu> |
| In reply to | #13680 |
Paul Rubin <no.email@nospam.invalid> writes Re: Permutation Tensor
> mhx@iae.nl (Marcel Hendrix) writes:
> > Nice register based code (when using the iForth PARAMS|), but the
> > multiplications apparently slow it down (eps). With Standard LOCALS| the
> > code is prepostorously slow (eps2).
> That's kind of interesting about the multiplications, on a modern
> machine. Can you change them to additions (so it will give the wrong
> answer) and see if it's faster? Thanks.
My hypothesis was wrong, the multiplication isn't important.
I completely rewrote the test because the loop overhead was too high
with respect to the work done by the lcsymbol code. With the new test,
all non-local implementations have approximately the same speed, and the
locals add a sizable overhead (because iForth doesn't inline words that
use locals.)
Also interesting: lcsymbol4 is very fast, but only for 64 bit code where
more free registers are available to the code generator.
-marcel
-- ----------
ANEW -lcsymbol
\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol ( i j k -- n )
>r >r 10 * r> + 10 * r> +
case
123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase
;
4 BASE !
: lcsymbol2 ( i j k -- n )
swap 2 lshift or swap 10 lshift or
case
123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase ;
DECIMAL
1 VALUE a
2 VALUE b
3 VALUE c
CREATE tab #64 CHARS ALLOT tab #64 CONST-DATA
MARKER -tab
: cases ( ix -- val )
case [ 4 base ! ]
012 of 2 endof
120 of 2 endof
201 of 2 endof
021 of 0 endof
102 of 0 endof
210 of 0 endof
[ decimal ]
1 swap
endcase ;
: lcsymbolX ( i j k -- n )
1-
swap 1- 2 lshift or
swap 1- 4 lshift or cases 1- ;
:NONAME tab #64 0 DO I cases OVER C! CHAR+ LOOP drop ; EXECUTE -tab
\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol3 ( i j k -- n ) 1- swap 1- 2 lshift or swap 1- 4 lshift or tab + C@ 1- ;
create lctable 1639701 , 1316133 , 1381905 , lctable 3 cells CONST-DATA
\ i, j, k are zero based!
: lcsymbol4 ( i j k -- n )
cells lctable + @ \ return the k^th tensor plane
>r
3 lshift over 2* + nip \ compute the bit offset
r> swap rshift 3 and 1-
;
: eps params| i j k | ( i j k -- n ) i j - j k - * k i - * 2/ ;
: eps2 LOCALS| k j i | ( i j k -- n ) i j - j k - * k i - * 2/ ;
: weps params| i j k | ( i j k -- n ) i j - j k - + k i - + 2/ ;
: weps2 LOCALS| k j i | ( i j k -- n ) i j - j k - + k i - + 2/ ;
: TEST ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol + a b c lcsymbol + a b c lcsymbol + a b c lcsymbol + LOOP drop ;
: TEST2 ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol2 + a b c lcsymbol2 + a b c lcsymbol2 + a b c lcsymbol2 + LOOP drop ;
: TEST3 ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol3 + a b c lcsymbol3 + a b c lcsymbol3 + a b c lcsymbol3 + LOOP drop ;
: TEST4 ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol4 + a b c lcsymbol4 + a b c lcsymbol4 + a b c lcsymbol4 + LOOP drop ;
: TESTe ( n1 -- n2 ) 0 SWAP 0 DO a b c eps + a b c eps + a b c eps + a b c eps + LOOP drop ;
: TESTe2 ( n1 -- n2 ) 0 SWAP 0 DO a b c eps2 + a b c eps2 + a b c eps2 + a b c eps2 + LOOP drop ;
: TESTwe ( n1 -- n2 ) 0 SWAP 0 DO a b c weps + a b c weps + a b c weps + a b c weps + LOOP drop ;
: TESTwe2 ( n1 -- n2 ) 0 SWAP 0 DO a b c weps2 + a b c weps2 + a b c weps2 + a b c weps2 + LOOP drop ;
: TOPTEST ( u -- )
2/ 2/ 1 UMAX LOCAL #times
CR ." \ lcsymbol : " timer-reset #times TEST .elapsed
CR ." \ lcsymbol2 : " timer-reset #times TEST2 .elapsed
CR ." \ lcsymbol3 : " timer-reset #times TEST3 .elapsed
CR ." \ lcsymbol4 : " timer-reset #times TEST4 .elapsed
CR ." \ eps : " timer-reset #times TESTe .elapsed
CR ." \ eps2 : " timer-reset #times TESTe2 .elapsed
CR ." \ weps : " timer-reset #times TESTwe .elapsed
CR ." \ weps2 : " timer-reset #times TESTwe2 .elapsed ;
\ 1,000,000,000 toptest ( 64bit code)
\ lcsymbol : 2.495 seconds elapsed.
\ lcsymbol2 : 2.217 seconds elapsed.
\ lcsymbol3 : 2.075 seconds elapsed.
\ lcsymbol4 : 1.778 seconds elapsed.
\ eps : 2.281 seconds elapsed.
\ eps2 : 7.390 seconds elapsed.
\ weps : 2.307 seconds elapsed.
\ weps2 : 7.264 seconds elapsed. ok
\ 1,000,000,000 toptest ( 32bit code)
\ lcsymbol : 2.400 seconds elapsed.
\ lcsymbol2 : 2.093 seconds elapsed.
\ lcsymbol3 : 2.069 seconds elapsed.
\ lcsymbol4 : 4.320 seconds elapsed.
\ eps : 2.265 seconds elapsed.
\ eps2 : 7.352 seconds elapsed.
\ weps : 2.298 seconds elapsed.
\ weps2 : 7.234 seconds elapsed. ok
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-07 16:43 -0700 |
| Message-ID | <7xpq87b0i2.fsf@ruckus.brouhaha.com> |
| In reply to | #13681 |
mhx@iae.nl (Marcel Hendrix) writes: > all non-local implementations have approximately the same speed, and the > locals add a sizable overhead (because iForth doesn't inline words that > use locals.) Thanks, how about this? : eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;
[toc] | [prev] | [next] | [standalone]
| From | mhx@iae.nl (Marcel Hendrix) |
|---|---|
| Date | 2012-07-08 09:07 +0200 |
| Message-ID | <17969495968435@frunobulax.edu> |
| In reply to | #13685 |
Paul Rubin <no.email@nospam.invalid> writes Re: Permutation Tensor > mhx@iae.nl (Marcel Hendrix) writes: >> all non-local implementations have approximately the same speed, and the >> locals add a sizable overhead (because iForth doesn't inline words that >> use locals.) > Thanks, how about this? > : eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ; FORTH> 1000000000 toptest \ lcsymbol : 2.409 seconds elapsed. \ lcsymbol2 : 2.136 seconds elapsed. \ lcsymbol3 : 2.016 seconds elapsed. \ lcsymbol4 : 1.711 seconds elapsed. \ eps : 2.221 seconds elapsed. \ eps2 : 7.174 seconds elapsed. \ eps3 : 2.221 seconds elapsed. \ weps : 2.225 seconds elapsed. \ weps2 : 7.008 seconds elapsed. ok As you can see, eps and eps3 differ insignificantly. Thanks to your questions, I looked into it a little further. Instead of doing only a b c eps3, I tried a b c eps3 1 b c eps3a, 1 2 c eps3ab, and 1 2 3 eps3abc (i.e. some arguments are constants instead of variables). The 1 2 3 eps3abc will calculate lcsymbol at compile time, therefore the loop overhead time comes out at 0.5 seconds. The other three implementations cause the word to fetch up to three cells from memory: \ eps3 : 2.221 seconds elapsed. ( fetch 3 cells ) \ eps3a : 1.579 seconds elapsed. ( fetch 2 cells ) \ eps3ab : 1.399 seconds elapsed. ( fetch 1 cell ) \ eps3abc : 0.513 seconds elapsed. ( compute at compile-time) The run-time difference for these implementations appears to be caused by their number of memory accesses. -marcel
[toc] | [prev] | [next] | [standalone]
| From | Albert van der Horst <albert@spenarnc.xs4all.nl> |
|---|---|
| Date | 2012-07-08 13:03 +0000 |
| Message-ID | <m6uexj.fjr@spenarnc.xs4all.nl> |
| In reply to | #13685 |
In article <7xpq87b0i2.fsf@ruckus.brouhaha.com>,
Paul Rubin <no.email@nospam.invalid> wrote:
>mhx@iae.nl (Marcel Hendrix) writes:
>> all non-local implementations have approximately the same speed, and the
>> locals add a sizable overhead (because iForth doesn't inline words that
>> use locals.)
>
>Thanks, how about this?
>
> : eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;
>
I come up with this:
(now tested)
The idea is once you established that the modulo 3 sum,
you need only inspect two to find out what the eps is.
"
REQUIRE CASE-INSENSITIVE
CASE-INSENSITIVE
REQUIRE 2>R
: -ROT POSTPONE ROT POSTPONE ROT ; IMMEDIATE
: eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;
\ Ensure proper mathematical mod.
: MOD' >R S>D R> FM/MOD DROP ;
: eps2 >R 2DUP + R> + 3 MOD IF 2DROP 0 ELSE SWAP - 1+ 3 MOD' 1- THEN ;
\ Less elegant works with floored mod too.
: eps2' >R 2DUP + R> + 3 MOD IF 2DROP 0 ELSE SWAP - 4 + 3 MOD 1- THEN ;
: test 4 1 DO I 4 1 DO 4 1 DO
DUP . J . I .
DUP J I EPS3 .
DUP J I EPS2 .
DUP J I EPS2' .
CR
LOOP LOOP DROP LOOP ;
"
In the FORTRAN era I hated the trick to use multiplication
to combine cases. But I'm sure your code will win hands down
on a Pentium for lack of branching.
The mod could be a table lookup, as the range is only -2..6
Groetjes Albert
--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2012-07-09 14:50 +0000 |
| Message-ID | <2012Jul9.165012@mips.complang.tuwien.ac.at> |
| In reply to | #13672 |
mhx@iae.nl (Marcel Hendrix) writes:
>The difference between branching and table lookup is unexpectedly large
>- I thought that memory was supposed to be slow. Apparently branching is
>becoming very expensive.
Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
Mispredicted branches are much slower (10-20 cycles).
Of course, if you miss all caches (several hundred cycles) and compare
that with a correctly predicted branch (~1 cycle) things look
differently.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2012: http://www.euroforth.org/ef12/
[toc] | [prev] | [next] | [standalone]
| From | m.a.m.hendrix@tue.nl |
|---|---|
| Date | 2012-07-10 00:28 -0700 |
| Message-ID | <63d9a45c-0796-4b49-ae16-c5420efdae71@googlegroups.com> |
| In reply to | #13732 |
On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
> mhx@iae.nl (Marcel Hendrix) writes:
>> The difference between branching and table lookup is unexpectedly large
>> - I thought that memory was supposed to be slow. Apparently branching is
>> becoming very expensive.
>
> Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
> Mispredicted branches are much slower (10-20 cycles).
It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
4 BASE !
: lcsymbol2 ( i j k -- n )
swap 2 lshift or swap 10 lshift or
case
\ 123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
123 of 1 endof
0 swap
endcase ;
DECIMAL
Note that the test harness in effect performs "1 2 3 lcsymbol2."
FORTH> 1000000000 TOPTEST ( original)
\ lcsymbol : 1.959 seconds elapsed.
\ lcsymbol2 : 1.750 seconds elapsed.
FORTH> 1000000000 TOPTEST ( swapped test cases)
\ lcsymbol : 1.957 seconds elapsed.
\ lcsymbol2 : 4.478 seconds elapsed.
This makes the jumptable approach a clear favorite.
-marcel
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2012-07-10 10:11 +0000 |
| Message-ID | <2012Jul10.121108@mips.complang.tuwien.ac.at> |
| In reply to | #13754 |
m.a.m.hendrix@tue.nl writes:
>On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
>> mhx@iae.nl (Marcel Hendrix) writes:
>>> The difference between branching and table lookup is unexpectedly large
>>> - I thought that memory was supposed to be slow. Apparently branching is
>>> becoming very expensive.
>>
>> Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
>> Mispredicted branches are much slower (10-20 cycles).
>
>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>
>4 BASE !
>: lcsymbol2 ( i j k -- n )
> swap 2 lshift or swap 10 lshift or
> case
>\ 123 of 1 endof
> 231 of 1 endof
> 312 of 1 endof
> 132 of -1 endof
> 213 of -1 endof
> 321 of -1 endof
>
> 123 of 1 endof
> 0 swap
> endcase ;
>DECIMAL
>
>Note that the test harness in effect performs "1 2 3 lcsymbol2."
Hmm, sounds like perfectly predictable branches. Then it's probably
something else. You can check the number of mispredictions with
performance counters.
>FORTH> 1000000000 TOPTEST ( original)
>\ lcsymbol : 1.959 seconds elapsed.
>\ lcsymbol2 : 1.750 seconds elapsed.
>
>FORTH> 1000000000 TOPTEST ( swapped test cases)
>\ lcsymbol : 1.957 seconds elapsed.
>\ lcsymbol2 : 4.478 seconds elapsed.
>
>This makes the jumptable approach a clear favorite.
Can you post the machine code for the second LCSYMBOL2?
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2012: http://www.euroforth.org/ef12/
[toc] | [prev] | [next] | [standalone]
| From | mhx@iae.nl (Marcel Hendrix) |
|---|---|
| Date | 2012-07-10 21:37 +0200 |
| Message-ID | <62661493968435@frunobulax.edu> |
| In reply to | #13765 |
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Permutation Tensor
> m.a.m.hendrix@tue.nl writes:
[..]
>>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
[..]
> Hmm, sounds like perfectly predictable branches. Then it's probably
> something else. You can check the number of mispredictions with
> performance counters.
>>This makes the jumptable approach a clear favorite.
[ Krishna: should have been lookup table ]
> Can you post the machine code for the second LCSYMBOL2?
4 BASE !
: lcsymbl2a ( i j k -- n )
swap 2 lshift or swap 10 lshift or
case
\ 123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
123 of 1 endof
0 swap
endcase ;
DECIMAL
1 VALUE a
2 VALUE b
3 VALUE c
: TEST2a ( n -- )
0 SWAP 0 DO a b c lcsymbl2a +
a b c lcsymbl2a +
a b c lcsymbl2a +
a b c lcsymbl2a +
LOOP drop ;
FORTH> ' test2a idis
$0124B540 : [trashed]
$0124B54A pop rbx
$0124B54B push 0 b#
$0124B54D push rbx
$0124B54E xor rbx, rbx
$0124B551 pop rcx
$0124B552 call (DO) offset NEAR
$0124B55C lea rax, [rax 0 +] qword
$0124B560 mov rdi, $012497A0 qword-offset
$0124B567 lea rdi, [rdi*4 0 +] qword
$0124B56F or rdi, $012497C0 qword-offset
$0124B576 mov rax, $01249780 qword-offset
$0124B57D shl rax, 4 b#
$0124B581 or rdi, rax
$0124B584 cmp rdi, #45 b#
$0124B588 push rbx
$0124B589 mov rbx, rcx
$0124B58C mov rcx, rdi
$0124B58F mov rbx, rcx
$0124B592 jne $0124B5A4 offset NEAR
$0124B598 mov rbx, 1 d#
$0124B59F jmp $0124B615 offset NEAR
$0124B5A4 cmp rbx, #54 b#
$0124B5A8 jne $0124B5BA offset NEAR
$0124B5AE mov rbx, 1 d#
$0124B5B5 jmp $0124B615 offset NEAR
$0124B5BA cmp rbx, #30 b#
$0124B5BE jne $0124B5D0 offset NEAR
$0124B5C4 mov rbx, -1 d#
$0124B5CB jmp $0124B615 offset NEAR
$0124B5D0 cmp rbx, #39 b#
$0124B5D4 jne $0124B5E6 offset NEAR
$0124B5DA mov rbx, -1 d#
$0124B5E1 jmp $0124B615 offset NEAR
$0124B5E6 cmp rbx, #57 b#
$0124B5EA jne $0124B5FC offset NEAR
$0124B5F0 mov rbx, -1 d#
$0124B5F7 jmp $0124B615 offset NEAR
$0124B5FC cmp rbx, #27 b#
$0124B600 jne $0124B612 offset NEAR
$0124B606 mov rbx, 1 d#
$0124B60D jmp $0124B615 offset NEAR
$0124B612 push 0 b#
$0124B614 pop rbx
$0124B615 pop rdi
$0124B616 mov rax, $012497A0 qword-offset
$0124B61D lea rax, [rax*4 0 +] qword
$0124B625 or rax, $012497C0 qword-offset
$0124B62C mov rdx, $01249780 qword-offset
$0124B633 shl rdx, 4 b#
$0124B637 or rax, rdx
$0124B63A lea rdx, [rdi rbx*1] qword
[ etc. ]
$0124B837 push 0 b#
$0124B839 pop rbx
$0124B83A pop rdi
$0124B83B lea rbx, [rdi rbx*1] qword
$0124B83F add [rbp 0 +] qword, 1 b#
$0124B844 add [rbp 8 +] qword, 1 b#
$0124B849 jno $0124B560 offset NEAR
$0124B84F add rbp, #24 b#
$0124B853 ;
-marcel
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2012-07-11 13:50 +0000 |
| Message-ID | <2012Jul11.155058@mips.complang.tuwien.ac.at> |
| In reply to | #13792 |
mhx@iae.nl (Marcel Hendrix) writes:
>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Permutation Tensor
>
>> m.a.m.hendrix@tue.nl writes:
>[..]
>>>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>[..]
>> Hmm, sounds like perfectly predictable branches. Then it's probably
>> something else. You can check the number of mispredictions with
>> performance counters.
[...]
>$0124B592 jne $0124B5A4 offset NEAR
>$0124B598 mov rbx, 1 d#
>$0124B59F jmp $0124B615 offset NEAR
>$0124B5A4 cmp rbx, #54 b#
>$0124B5A8 jne $0124B5BA offset NEAR
>$0124B5AE mov rbx, 1 d#
>$0124B5B5 jmp $0124B615 offset NEAR
>$0124B5BA cmp rbx, #30 b#
>$0124B5BE jne $0124B5D0 offset NEAR
>$0124B5C4 mov rbx, -1 d#
>$0124B5CB jmp $0124B615 offset NEAR
>$0124B5D0 cmp rbx, #39 b#
>$0124B5D4 jne $0124B5E6 offset NEAR
>$0124B5DA mov rbx, -1 d#
>$0124B5E1 jmp $0124B615 offset NEAR
>$0124B5E6 cmp rbx, #57 b#
>$0124B5EA jne $0124B5FC offset NEAR
>$0124B5F0 mov rbx, -1 d#
>$0124B5F7 jmp $0124B615 offset NEAR
>$0124B5FC cmp rbx, #27 b#
>$0124B600 jne $0124B612 offset NEAR
>$0124B606 mov rbx, 1 d#
>$0124B60D jmp $0124B615 offset NEAR
[...]
Hmm, I remember something about the number of branch prediction table
entries per cache line or something being limited. Maybe that's the
reason for the low performance.
If I wanted to verify this, I would first use performance counters to
check if there is a high number of branch mispredictions, and if so,
insert some padding after the "jmp"s to see if it helps.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2012: http://www.euroforth.org/ef12/
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-10 05:10 -0700 |
| Message-ID | <dc2a99c8-57b7-4153-bb72-06017be2a3f6@r3g2000yqh.googlegroups.com> |
| In reply to | #13754 |
On Jul 10, 2:28 am, m.a.m.hend...@tue.nl wrote: > On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote: > > m...@iae.nl (Marcel Hendrix) writes: > >> The difference between branching and table lookup is unexpectedly large > >> - I thought that memory was supposed to be slow. Apparently branching is > >> becoming very expensive. > > > Cache hits are relatively fast (3-4 cycles for a L1 cache hit). > > Mispredicted branches are much slower (10-20 cycles). > > It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2: > ... > This makes the jumptable approach a clear favorite. > > -marcel Lookup table or jump table? Krishna
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-07-06 20:46 -0700 |
| Message-ID | <7xzk7c6xo1.fsf@ruckus.brouhaha.com> |
| In reply to | #13595 |
"A. K." <akk@nospam.org> writes:
> then CASEs
> n = 123 or 312 or 231 --> eps = 1
> n = 132 or 321 or 213 --> eps = -1
> otherwise --> eps = 0
With locals:
: eps { i j k -- n }
i j - j k - * k i - * 2/ ;
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-07 05:36 -0700 |
| Message-ID | <5728b7da-2bef-4cad-bfd4-664ba077aafc@b20g2000yqg.googlegroups.com> |
| In reply to | #13656 |
On Jul 6, 10:46 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "A. K." <a...@nospam.org> writes:
> > then CASEs
> > n = 123 or 312 or 231 --> eps = 1
> > n = 132 or 321 or 213 --> eps = -1
> > otherwise --> eps = 0
>
> With locals:
>
> : eps { i j k -- n }
> i j - j k - * k i - * 2/ ;
Nice.
KM
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-06 18:18 -0700 |
| Message-ID | <1e0e4259-7a48-4ca9-8ec0-84c532adda6a@e37g2000yqn.googlegroups.com> |
| In reply to | #13587 |
On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
> What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> Thanks,
> Arnold
I would use a precomputed binary lookup table -- there are only 27
elements x 2 bits/element = 54 bits, so the entire tensor should fit
into a single 64-bit word. Now, you need a mapping function from {i,
j, k} to the start bit position, but that's easy.
Krishna
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-06 18:33 -0700 |
| Message-ID | <706a4816-489e-44eb-acd0-5bc352bb298d@b20g2000yqg.googlegroups.com> |
| In reply to | #13652 |
On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > Thanks,
> > Arnold
>
> I would use a precomputed binary lookup table -- there are only 27
> elements x 2 bits/element = 54 bits, so the entire tensor should fit
> into a single 64-bit word. Now, you need a mapping function from {i,
> j, k} to the start bit position, but that's easy.
>
> Krishna
Spoke too soon. A.K.'s method is easier.
Krishna
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-07 07:28 -0700 |
| Message-ID | <3d6d29ee-9f0d-4012-9cee-9be429d600ac@a34g2000yqi.googlegroups.com> |
| In reply to | #13652 |
On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > Thanks,
> > Arnold
>
> I would use a precomputed binary lookup table -- there are only 27
> elements x 2 bits/element = 54 bits, so the entire tensor should fit
> into a single 64-bit word. Now, you need a mapping function from {i,
> j, k} to the start bit position, but that's easy.
>
> Krishna
Just for fun, I went ahead and implemented a lookup table version
also.
\ WARNING: this version assumes a minimum 32 bit cell size and
\ little endian ordering! Also, i,j,k are zero-based!
\ We will use 3 cells to hold the tensor. Each cell will
\ represent one plane of the 3D tensor. k will index the cell.
\ i and j will index the bits in the cell.
create lctable 1639701 , 1316133 , 1381905 ,
\ i, j, k are zero based!
: lcsym4 ( i j k -- n )
cells lctable + @ \ return the k^th tensor plane
>r
3 lshift over 2* + nip \ compute the bit offset
r> swap rshift 3 and 1-
;
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-07 08:02 -0700 |
| Message-ID | <f2156282-b4ba-483d-8a7b-08e300cfcbc0@h10g2000yqn.googlegroups.com> |
| In reply to | #13668 |
On Jul 7, 9:28 am, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
>
> > On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > > Thanks,
> > > Arnold
>
> > I would use a precomputed binary lookup table -- there are only 27
> > elements x 2 bits/element = 54 bits, so the entire tensor should fit
> > into a single 64-bit word. Now, you need a mapping function from {i,
> > j, k} to the start bit position, but that's easy.
>
> > Krishna
>
> Just for fun, I went ahead and implemented a lookup table version
> also.
>
> \ WARNING: this version assumes a minimum 32 bit cell size and
> \ little endian ordering! Also, i,j,k are zero-based!
>
The warning is incorrect. The code should work fine on a big endian
system as well, provided the cell size is at least 32 bits.
KM
[toc] | [prev] | [next] | [standalone]
| From | Albert van der Horst <albert@spenarnc.xs4all.nl> |
|---|---|
| Date | 2012-07-07 19:58 +0000 |
| Message-ID | <m6t3i7.6ey@spenarnc.xs4all.nl> |
| In reply to | #13587 |
In article <jt3erh$cd2$1@dont-email.me>,
Arnold Doray <invalid@invalid.com> wrote:
>What's the best way to implement the 3D permutation tensor E(I,J,K)?
Something along
: eps
3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
/ At this point we know that all indices are different.
BEGIN DUP 3 <> WHILE ROT REPEAT
DROP < 2* 1-
;
Come on boys, cases!
>
>Thanks,
>Arnold
--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
| From | Krishna Myneni <krishna.myneni@ccreweb.org> |
|---|---|
| Date | 2012-07-08 05:45 -0700 |
| Message-ID | <6774ceb8-dbe5-47db-890f-2f49eddcfe3b@d24g2000yqh.googlegroups.com> |
| In reply to | #13678 |
On Jul 7, 2:58 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
wrote:
> In article <jt3erh$cd...@dont-email.me>,
> Arnold Doray <inva...@invalid.com> wrote:
>
> >What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> Something along
> : eps
> 3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
> / At this point we know that all indices are different.
> BEGIN DUP 3 <> WHILE ROT REPEAT
> DROP < 2* 1-
> ;
>
The above seems to go into an infinite loop for me.
: epsh ( i j k -- n )
( 3DUP) 2dup 2>r 2 pick 2r>
+ + 3 MOD IF ( 3DROP) 2drop drop 0 EXIT THEN
\ At this point we know that all indices are different.
BEGIN DUP 3 <> WHILE ROT REPEAT
DROP < 2* 1-
;
1 1 1 epsh
KM
[toc] | [prev] | [next] | [standalone]
| From | Albert van der Horst <albert@spenarnc.xs4all.nl> |
|---|---|
| Date | 2012-07-08 20:05 +0000 |
| Message-ID | <m6uyhe.n7x@spenarnc.xs4all.nl> |
| In reply to | #13706 |
In article <6774ceb8-dbe5-47db-890f-2f49eddcfe3b@d24g2000yqh.googlegroups.com>, Krishna Myneni <krishna.myneni@ccreweb.org> wrote: >On Jul 7, 2:58=A0pm, Albert van der Horst <alb...@spenarnc.xs4all.nl> >wrote: >> In article <jt3erh$cd...@dont-email.me>, >> Arnold Doray =A0<inva...@invalid.com> wrote: >> >> >What's the best way to implement the 3D permutation tensor E(I,J,K)? >> >> Something along >> : eps >> =A0 =A0 =A0 =A0 3DUP + + 3 MOD IF 3DROP 0 EXIT THEN >> =A0 =A0 =A0 =A0 / At this point we know that all indices are different. >> =A0 =A0 =A0 =A0 BEGIN DUP 3 <> WHILE ROT REPEAT >> =A0 =A0 =A0 =A0 DROP < 2* 1- >> ; >> > >The above seems to go into an infinite loop for me. > >: epsh ( i j k -- n ) > ( 3DUP) 2dup 2>r 2 pick 2r> > + + 3 MOD IF ( 3DROP) 2drop drop 0 EXIT THEN > \ At this point we know that all indices are different. > BEGIN DUP 3 <> WHILE ROT REPEAT > DROP < 2* 1- >; > >1 1 1 epsh Too hasty! I overlooked the possibility of 3 equals and didn't test. > > >KM > Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
Back to top | Article view | comp.lang.forth
csiph-web