Groups > comp.lang.forth > #13587 > unrolled thread

Permutation Tensor

Started by	Arnold Doray <invalid@invalid.com>
First post	2012-07-05 07:13 +0000
Last post	2012-07-25 14:08 +0000
Articles	20 on this page of 53 — 10 participants

Back to article view | Back to comp.lang.forth

  Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-05 07:13 +0000
    Re: Permutation Tensor "A. K." <akk@nospam.org> - 2012-07-05 09:25 +0200
      Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-05 08:42 +0000
        Re: Permutation Tensor "A. K." <akk@nospam.org> - 2012-07-05 11:25 +0200
          Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-06 18:49 -0700
            Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 08:32 +0200
              Re: Permutation Tensor "A. K." <akk@nospam.org> - 2012-07-07 09:49 +0200
              Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 05:35 -0700
                 Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 15:45 +0200
                  Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 07:32 -0700
                    Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 17:53 +0200
                      Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 09:10 -0700
                        Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 18:41 +0200
                          Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 12:24 -0700
                            Re: Permutation Tensor Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2012-07-08 07:48 +0100
                              Re: Permutation Tensor Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2012-07-08 08:09 +0100
                              Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-08 09:23 +0200
                              Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-08 05:55 -0700
                                Re: Permutation Tensor Gerry Jackson <gerry@jackson9000.fsnet.co.uk> - 2012-07-08 20:14 +0100
                                  Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-08 13:49 -0700
                          Re: Permutation Tensor Paul Rubin <no.email@nospam.invalid> - 2012-07-07 13:03 -0700
                            Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-07 23:54 +0200
                              Re: Permutation Tensor Paul Rubin <no.email@nospam.invalid> - 2012-07-07 16:43 -0700
                                Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-08 09:07 +0200
                                Re: Permutation Tensor Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-08 13:03 +0000
                      Re: Permutation Tensor anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-09 14:50 +0000
                        Re: Permutation Tensor m.a.m.hendrix@tue.nl - 2012-07-10 00:28 -0700
                          Re: Permutation Tensor anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-10 10:11 +0000
                            Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-10 21:37 +0200
                              Re: Permutation Tensor anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2012-07-11 13:50 +0000
                          Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-10 05:10 -0700
          Re: Permutation Tensor Paul Rubin <no.email@nospam.invalid> - 2012-07-06 20:46 -0700
            Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 05:36 -0700
    Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-06 18:18 -0700
      Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-06 18:33 -0700
      Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 07:28 -0700
        Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-07 08:02 -0700
    Re: Permutation Tensor Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-07 19:58 +0000
      Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-08 05:45 -0700
        Re: Permutation Tensor Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-08 20:05 +0000
    Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-08 14:24 +0200
      Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-11 07:09 +0200
        Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-11 18:42 +0200
          Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-11 20:18 +0200
            Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-12 12:43 +0200
              Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-25 14:07 +0000
                Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-27 13:42 +0200
                  Re: Permutation Tensor mhx@iae.nl (Marcel Hendrix) - 2012-07-27 15:39 +0200
                    Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-30 04:52 +0000
    Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-08 23:44 +0200
      Re: Permutation Tensor Krishna Myneni <krishna.myneni@ccreweb.org> - 2012-07-10 17:06 -0700
        Re: Permutation Tensor awegel@arcor.de (Alex Wegel) - 2012-07-11 07:09 +0200
    Re: Permutation Tensor Arnold Doray <invalid@invalid.com> - 2012-07-25 14:08 +0000

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

#13680

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-07 13:03 -0700
Message-ID	<7xzk7bxrsy.fsf@ruckus.brouhaha.com>
In reply to	#13674

mhx@iae.nl (Marcel Hendrix) writes:
> Nice register based code (when using the iForth PARAMS|), but the 
> multiplications apparently slow it down (eps). With Standard LOCALS| the 
> code is prepostorously slow (eps2).

That's kind of interesting about the multiplications, on a modern
machine.  Can you change them to additions (so it will give the wrong
answer) and see if it's faster?  Thanks.

[toc] | [prev] | [next] | [standalone]

#13681

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-07 23:54 +0200
Message-ID	<60811296968435@frunobulax.edu>
In reply to	#13680

Paul Rubin <no.email@nospam.invalid> writes Re: Permutation Tensor

> mhx@iae.nl (Marcel Hendrix) writes:
> > Nice register based code (when using the iForth PARAMS|), but the 
> > multiplications apparently slow it down (eps). With Standard LOCALS| the 
> > code is prepostorously slow (eps2).

> That's kind of interesting about the multiplications, on a modern
> machine.  Can you change them to additions (so it will give the wrong
> answer) and see if it's faster?  Thanks.

My hypothesis was wrong, the multiplication isn't important.

I completely rewrote the test because the loop overhead was too high 
with respect to the work done by the lcsymbol code. With the new test, 
all non-local implementations have approximately the same speed, and the
locals add a sizable overhead (because iForth doesn't inline words that 
use locals.)

Also interesting: lcsymbol4 is very fast, but only for 64 bit code where 
more free registers are available to the code generator.

-marcel

-- ----------
ANEW -lcsymbol

\ For i,j,k  in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol ( i j k -- n )
    >r >r 10 * r> + 10 * r> +
    case
      123 of  1 endof
      231 of  1 endof
      312 of  1 endof
      132 of -1 endof
      213 of -1 endof
      321 of -1 endof
      0 swap
    endcase
;

4 BASE !
: lcsymbol2 ( i j k -- n )
    swap 2 lshift or swap 10 lshift or  	
    case 
       123 of  1 endof
       231 of  1 endof
       312 of  1 endof
       132 of -1 endof
       213 of -1 endof
       321 of -1 endof
      0 swap
    endcase ;
DECIMAL

1 VALUE a
2 VALUE b
3 VALUE c

CREATE tab #64 CHARS ALLOT tab #64 CONST-DATA

MARKER -tab
: cases ( ix -- val )
    case [ 4 base ! ]
       012 of  2 endof
       120 of  2 endof
       201 of  2 endof
       021 of  0 endof
       102 of  0 endof
       210 of  0 endof
       [ decimal ]
      1 swap
    endcase ;

: lcsymbolX ( i j k -- n )
         1- 
    swap 1- 2 lshift or 
    swap 1- 4 lshift or cases 1- ;

:NONAME	tab  #64 0 DO  I cases OVER C! CHAR+ LOOP drop ; EXECUTE -tab

\ For i,j,k  in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol3 ( i j k -- n ) 1- swap 1- 2 lshift or  swap 1- 4 lshift or tab + C@ 1- ;

create lctable  1639701 , 1316133 , 1381905 , lctable 3 cells CONST-DATA

\ i, j, k are zero based!

: lcsymbol4 ( i j k -- n )
    cells lctable + @  \ return the k^th tensor plane
    >r
    3 lshift over 2* + nip  \ compute the bit offset
    r> swap rshift 3 and 1-
;

: eps   params| i j k | ( i j k -- n )  i j - j k - * k i - * 2/ ; 
: eps2  LOCALS| k j i | ( i j k -- n )  i j - j k - * k i - * 2/ ; 
: weps  params| i j k | ( i j k -- n )  i j - j k - + k i - + 2/ ; 
: weps2 LOCALS| k j i | ( i j k -- n )  i j - j k - + k i - + 2/ ; 

: TEST    ( n1 -- n2 ) 0 SWAP 0 DO  a b c lcsymbol  + a b c lcsymbol  + a b c lcsymbol  + a b c lcsymbol  +  LOOP drop ;
: TEST2   ( n1 -- n2 ) 0 SWAP 0 DO  a b c lcsymbol2 + a b c lcsymbol2 + a b c lcsymbol2 + a b c lcsymbol2 +  LOOP drop ;
: TEST3   ( n1 -- n2 ) 0 SWAP 0 DO  a b c lcsymbol3 + a b c lcsymbol3 + a b c lcsymbol3 + a b c lcsymbol3 +  LOOP drop ;
: TEST4   ( n1 -- n2 ) 0 SWAP 0 DO  a b c lcsymbol4 + a b c lcsymbol4 + a b c lcsymbol4 + a b c lcsymbol4 +  LOOP drop ;
: TESTe   ( n1 -- n2 ) 0 SWAP 0 DO  a b c eps       + a b c eps       + a b c eps       + a b c eps       +  LOOP drop ;
: TESTe2  ( n1 -- n2 ) 0 SWAP 0 DO  a b c eps2      + a b c eps2      + a b c eps2      + a b c eps2      +  LOOP drop ;
: TESTwe  ( n1 -- n2 ) 0 SWAP 0 DO  a b c weps      + a b c weps      + a b c weps      + a b c weps      +  LOOP drop ;
: TESTwe2 ( n1 -- n2 ) 0 SWAP 0 DO  a b c weps2     + a b c weps2     + a b c weps2     + a b c weps2     +  LOOP drop ;

: TOPTEST ( u -- )
	2/ 2/ 1 UMAX LOCAL #times
	CR ." \ lcsymbol  : " timer-reset #times TEST    .elapsed
	CR ." \ lcsymbol2 : " timer-reset #times TEST2   .elapsed
	CR ." \ lcsymbol3 : " timer-reset #times TEST3   .elapsed
	CR ." \ lcsymbol4 : " timer-reset #times TEST4   .elapsed 
	CR ." \ eps       : " timer-reset #times TESTe   .elapsed 
	CR ." \ eps2      : " timer-reset #times TESTe2  .elapsed
	CR ." \ weps      : " timer-reset #times TESTwe  .elapsed 
	CR ." \ weps2     : " timer-reset #times TESTwe2 .elapsed ;

\ 1,000,000,000 toptest ( 64bit code)
\ lcsymbol  : 2.495 seconds elapsed.
\ lcsymbol2 : 2.217 seconds elapsed.
\ lcsymbol3 : 2.075 seconds elapsed.
\ lcsymbol4 : 1.778 seconds elapsed.
\ eps       : 2.281 seconds elapsed.
\ eps2      : 7.390 seconds elapsed.
\ weps      : 2.307 seconds elapsed.
\ weps2     : 7.264 seconds elapsed. ok

\ 1,000,000,000 toptest ( 32bit code)
\ lcsymbol  : 2.400 seconds elapsed.
\ lcsymbol2 : 2.093 seconds elapsed.
\ lcsymbol3 : 2.069 seconds elapsed.
\ lcsymbol4 : 4.320 seconds elapsed.
\ eps       : 2.265 seconds elapsed.
\ eps2      : 7.352 seconds elapsed.
\ weps      : 2.298 seconds elapsed.
\ weps2     : 7.234 seconds elapsed. ok

[toc] | [prev] | [next] | [standalone]

#13685

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-07 16:43 -0700
Message-ID	<7xpq87b0i2.fsf@ruckus.brouhaha.com>
In reply to	#13681

mhx@iae.nl (Marcel Hendrix) writes:
> all non-local implementations have approximately the same speed, and the
> locals add a sizable overhead (because iForth doesn't inline words that 
> use locals.)

Thanks, how about this?

  : eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;

[toc] | [prev] | [next] | [standalone]

#13695

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-08 09:07 +0200
Message-ID	<17969495968435@frunobulax.edu>
In reply to	#13685

Paul Rubin <no.email@nospam.invalid> writes Re: Permutation Tensor

> mhx@iae.nl (Marcel Hendrix) writes:
>> all non-local implementations have approximately the same speed, and the
>> locals add a sizable overhead (because iForth doesn't inline words that 
>> use locals.)

> Thanks, how about this?

>  : eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ; 

FORTH> 1000000000 toptest
\ lcsymbol  : 2.409 seconds elapsed.
\ lcsymbol2 : 2.136 seconds elapsed.
\ lcsymbol3 : 2.016 seconds elapsed.
\ lcsymbol4 : 1.711 seconds elapsed.
\ eps       : 2.221 seconds elapsed.
\ eps2      : 7.174 seconds elapsed.
\ eps3      : 2.221 seconds elapsed.
\ weps      : 2.225 seconds elapsed.
\ weps2     : 7.008 seconds elapsed. ok

As you can see, eps and eps3 differ insignificantly.

Thanks to your questions, I looked into it a little further.
Instead of doing only a b c eps3, I tried  a b c eps3
1 b c eps3a, 1 2 c eps3ab, and 1 2 3 eps3abc (i.e. some
arguments are constants instead of variables). 

The 1 2 3 eps3abc will calculate lcsymbol at compile time, 
therefore the loop overhead time comes out at 0.5 seconds.

The other three implementations cause the word to fetch 
up to three cells from memory:

\ eps3      : 2.221 seconds elapsed. ( fetch 3 cells )
\ eps3a     : 1.579 seconds elapsed. ( fetch 2 cells )
\ eps3ab    : 1.399 seconds elapsed. ( fetch 1 cell )
\ eps3abc   : 0.513 seconds elapsed. ( compute at compile-time)

The run-time difference for these implementations
appears to be caused by their number of memory accesses.

-marcel

[toc] | [prev] | [next] | [standalone]

#13702

From	Albert van der Horst <albert@spenarnc.xs4all.nl>
Date	2012-07-08 13:03 +0000
Message-ID	<m6uexj.fjr@spenarnc.xs4all.nl>
In reply to	#13685

In article <7xpq87b0i2.fsf@ruckus.brouhaha.com>,
Paul Rubin  <no.email@nospam.invalid> wrote:
>mhx@iae.nl (Marcel Hendrix) writes:
>> all non-local implementations have approximately the same speed, and the
>> locals add a sizable overhead (because iForth doesn't inline words that
>> use locals.)
>
>Thanks, how about this?
>
>  : eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;
>

I come up with this:
(now tested)

The idea is once you established that the modulo 3 sum,
you need only inspect two to find out what the eps is.

"
REQUIRE CASE-INSENSITIVE
CASE-INSENSITIVE
REQUIRE 2>R

: -ROT POSTPONE ROT POSTPONE ROT ; IMMEDIATE

: eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;


\ Ensure proper mathematical mod.
: MOD'  >R S>D R> FM/MOD DROP ;

: eps2  >R 2DUP + R> + 3 MOD IF  2DROP 0  ELSE  SWAP - 1+ 3 MOD' 1- THEN ;

\ Less elegant works with floored mod too.
: eps2'  >R 2DUP + R> + 3 MOD IF  2DROP 0  ELSE  SWAP - 4 + 3 MOD 1- THEN ;

: test 4 1 DO I         4 1 DO  4 1 DO
    DUP . J . I .
    DUP  J I EPS3 .
    DUP  J I EPS2 .
    DUP  J I EPS2' .
    CR
LOOP LOOP DROP LOOP ;

"

In the FORTRAN era I hated the trick to use multiplication
to combine cases. But I'm sure your code will win hands down
on a Pentium for lack of branching.

The mod could be a table lookup, as the range is only -2..6

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]

#13732

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2012-07-09 14:50 +0000
Message-ID	<2012Jul9.165012@mips.complang.tuwien.ac.at>
In reply to	#13672

mhx@iae.nl (Marcel Hendrix) writes:
>The difference between branching and table lookup is unexpectedly large 
>- I thought that memory was supposed to be slow. Apparently branching is 
>becoming very expensive.

Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
Mispredicted branches are much slower (10-20 cycles).

Of course, if you miss all caches (several hundred cycles) and compare
that with a correctly predicted branch (~1 cycle) things look
differently.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/

[toc] | [prev] | [next] | [standalone]

#13754

From	m.a.m.hendrix@tue.nl
Date	2012-07-10 00:28 -0700
Message-ID	<63d9a45c-0796-4b49-ae16-c5420efdae71@googlegroups.com>
In reply to	#13732

On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
> mhx@iae.nl (Marcel Hendrix) writes:
>> The difference between branching and table lookup is unexpectedly large 
>> - I thought that memory was supposed to be slow. Apparently branching is 
>> becoming very expensive.
> 
> Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
> Mispredicted branches are much slower (10-20 cycles).

It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:

4 BASE !
: lcsymbol2 ( i j k -- n )
    swap 2 lshift or swap 10 lshift or  	
    case 
\      123 of  1   endof
       231 of  1   endof
       312 of  1   endof
       132 of -1   endof
       213 of -1   endof
       321 of -1   endof

       123 of  1   endof
      0 swap
    endcase ;
DECIMAL

Note that the test harness in effect performs "1 2 3 lcsymbol2."

FORTH> 1000000000 TOPTEST ( original)
\ lcsymbol   : 1.959 seconds elapsed.
\ lcsymbol2  : 1.750 seconds elapsed.

FORTH> 1000000000 TOPTEST ( swapped test cases)
\ lcsymbol   : 1.957 seconds elapsed.
\ lcsymbol2  : 4.478 seconds elapsed.

This makes the jumptable approach a clear favorite.

-marcel

[toc] | [prev] | [next] | [standalone]

#13765

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2012-07-10 10:11 +0000
Message-ID	<2012Jul10.121108@mips.complang.tuwien.ac.at>
In reply to	#13754

m.a.m.hendrix@tue.nl writes:
>On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
>> mhx@iae.nl (Marcel Hendrix) writes:
>>> The difference between branching and table lookup is unexpectedly large 
>>> - I thought that memory was supposed to be slow. Apparently branching is 
>>> becoming very expensive.
>> 
>> Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
>> Mispredicted branches are much slower (10-20 cycles).
>
>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>
>4 BASE !
>: lcsymbol2 ( i j k -- n )
>    swap 2 lshift or swap 10 lshift or  	
>    case 
>\      123 of  1   endof
>       231 of  1   endof
>       312 of  1   endof
>       132 of -1   endof
>       213 of -1   endof
>       321 of -1   endof
>
>       123 of  1   endof
>      0 swap
>    endcase ;
>DECIMAL
>
>Note that the test harness in effect performs "1 2 3 lcsymbol2."

Hmm, sounds like perfectly predictable branches.  Then it's probably
something else.  You can check the number of mispredictions with
performance counters.

>FORTH> 1000000000 TOPTEST ( original)
>\ lcsymbol   : 1.959 seconds elapsed.
>\ lcsymbol2  : 1.750 seconds elapsed.
>
>FORTH> 1000000000 TOPTEST ( swapped test cases)
>\ lcsymbol   : 1.957 seconds elapsed.
>\ lcsymbol2  : 4.478 seconds elapsed.
>
>This makes the jumptable approach a clear favorite.

Can you post the machine code for the second LCSYMBOL2?

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/

[toc] | [prev] | [next] | [standalone]

#13792

From	mhx@iae.nl (Marcel Hendrix)
Date	2012-07-10 21:37 +0200
Message-ID	<62661493968435@frunobulax.edu>
In reply to	#13765

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Permutation Tensor

> m.a.m.hendrix@tue.nl writes:
[..]
>>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
[..]
> Hmm, sounds like perfectly predictable branches.  Then it's probably
> something else.  You can check the number of mispredictions with
> performance counters.

>>This makes the jumptable approach a clear favorite.

[ Krishna: should have been lookup table ]

> Can you post the machine code for the second LCSYMBOL2?


4 BASE !
: lcsymbl2a ( i j k -- n )
    swap 2 lshift or swap 10 lshift or  	
    case 
\      123 of  1 endof
       231 of  1 endof
       312 of  1 endof
       132 of -1 endof
       213 of -1 endof
       321 of -1 endof
       123 of  1 endof
      0 swap
    endcase ;
DECIMAL

1 VALUE a 
2 VALUE b 
3 VALUE c 

: TEST2a  ( n -- ) 
	0 SWAP 0 DO  	a b c lcsymbl2a + 
			a b c lcsymbl2a + 
			a b c lcsymbl2a + 
			a b c lcsymbl2a +  
	       LOOP  	drop ;

FORTH> ' test2a idis
$0124B540  : [trashed]
$0124B54A  pop           rbx
$0124B54B  push          0 b#
$0124B54D  push          rbx
$0124B54E  xor           rbx, rbx
$0124B551  pop           rcx
$0124B552  call          (DO) offset NEAR
$0124B55C  lea           rax, [rax 0 +] qword

$0124B560  mov           rdi, $012497A0 qword-offset
$0124B567  lea           rdi, [rdi*4 0 +] qword
$0124B56F  or            rdi, $012497C0 qword-offset
$0124B576  mov           rax, $01249780 qword-offset
$0124B57D  shl           rax, 4 b#
$0124B581  or            rdi, rax
$0124B584  cmp           rdi, #45 b#
$0124B588  push          rbx
$0124B589  mov           rbx, rcx
$0124B58C  mov           rcx, rdi
$0124B58F  mov           rbx, rcx
$0124B592  jne           $0124B5A4 offset NEAR
$0124B598  mov           rbx, 1 d#
$0124B59F  jmp           $0124B615 offset NEAR
$0124B5A4  cmp           rbx, #54 b#
$0124B5A8  jne           $0124B5BA offset NEAR
$0124B5AE  mov           rbx, 1 d#
$0124B5B5  jmp           $0124B615 offset NEAR
$0124B5BA  cmp           rbx, #30 b#
$0124B5BE  jne           $0124B5D0 offset NEAR
$0124B5C4  mov           rbx, -1 d#
$0124B5CB  jmp           $0124B615 offset NEAR
$0124B5D0  cmp           rbx, #39 b#
$0124B5D4  jne           $0124B5E6 offset NEAR
$0124B5DA  mov           rbx, -1 d#
$0124B5E1  jmp           $0124B615 offset NEAR
$0124B5E6  cmp           rbx, #57 b#
$0124B5EA  jne           $0124B5FC offset NEAR
$0124B5F0  mov           rbx, -1 d#
$0124B5F7  jmp           $0124B615 offset NEAR
$0124B5FC  cmp           rbx, #27 b#
$0124B600  jne           $0124B612 offset NEAR
$0124B606  mov           rbx, 1 d#
$0124B60D  jmp           $0124B615 offset NEAR
$0124B612  push          0 b#
$0124B614  pop           rbx
$0124B615  pop           rdi
$0124B616  mov           rax, $012497A0 qword-offset
$0124B61D  lea           rax, [rax*4 0 +] qword
$0124B625  or            rax, $012497C0 qword-offset
$0124B62C  mov           rdx, $01249780 qword-offset
$0124B633  shl           rdx, 4 b#
$0124B637  or            rax, rdx
$0124B63A  lea           rdx, [rdi rbx*1] qword

[ etc. ]

$0124B837  push          0 b#
$0124B839  pop           rbx
$0124B83A  pop           rdi
$0124B83B  lea           rbx, [rdi rbx*1] qword
$0124B83F  add           [rbp 0 +] qword, 1 b#
$0124B844  add           [rbp 8 +] qword, 1 b#
$0124B849  jno           $0124B560 offset NEAR
$0124B84F  add           rbp, #24 b#
$0124B853  ;

-marcel

[toc] | [prev] | [next] | [standalone]

#13838

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2012-07-11 13:50 +0000
Message-ID	<2012Jul11.155058@mips.complang.tuwien.ac.at>
In reply to	#13792

mhx@iae.nl (Marcel Hendrix) writes:
>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Permutation Tensor
>
>> m.a.m.hendrix@tue.nl writes:
>[..]
>>>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>[..]
>> Hmm, sounds like perfectly predictable branches.  Then it's probably
>> something else.  You can check the number of mispredictions with
>> performance counters.
[...]
>$0124B592  jne           $0124B5A4 offset NEAR
>$0124B598  mov           rbx, 1 d#
>$0124B59F  jmp           $0124B615 offset NEAR
>$0124B5A4  cmp           rbx, #54 b#
>$0124B5A8  jne           $0124B5BA offset NEAR
>$0124B5AE  mov           rbx, 1 d#
>$0124B5B5  jmp           $0124B615 offset NEAR
>$0124B5BA  cmp           rbx, #30 b#
>$0124B5BE  jne           $0124B5D0 offset NEAR
>$0124B5C4  mov           rbx, -1 d#
>$0124B5CB  jmp           $0124B615 offset NEAR
>$0124B5D0  cmp           rbx, #39 b#
>$0124B5D4  jne           $0124B5E6 offset NEAR
>$0124B5DA  mov           rbx, -1 d#
>$0124B5E1  jmp           $0124B615 offset NEAR
>$0124B5E6  cmp           rbx, #57 b#
>$0124B5EA  jne           $0124B5FC offset NEAR
>$0124B5F0  mov           rbx, -1 d#
>$0124B5F7  jmp           $0124B615 offset NEAR
>$0124B5FC  cmp           rbx, #27 b#
>$0124B600  jne           $0124B612 offset NEAR
>$0124B606  mov           rbx, 1 d#
>$0124B60D  jmp           $0124B615 offset NEAR
[...]

Hmm, I remember something about the number of branch prediction table
entries per cache line or something being limited.  Maybe that's the
reason for the low performance.

If I wanted to verify this, I would first use performance counters to
check if there is a high number of branch mispredictions, and if so,
insert some padding after the "jmp"s to see if it helps.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2012: http://www.euroforth.org/ef12/

[toc] | [prev] | [next] | [standalone]

#13771

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-10 05:10 -0700
Message-ID	<dc2a99c8-57b7-4153-bb72-06017be2a3f6@r3g2000yqh.googlegroups.com>
In reply to	#13754

On Jul 10, 2:28 am, m.a.m.hend...@tue.nl wrote:
> On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
> > m...@iae.nl (Marcel Hendrix) writes:
> >> The difference between branching and table lookup is unexpectedly large
> >> - I thought that memory was supposed to be slow. Apparently branching is
> >> becoming very expensive.
>
> > Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
> > Mispredicted branches are much slower (10-20 cycles).
>
> It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>
...
> This makes the jumptable approach a clear favorite.
>
> -marcel


Lookup table or jump table?

Krishna

[toc] | [prev] | [next] | [standalone]

#13656

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-06 20:46 -0700
Message-ID	<7xzk7c6xo1.fsf@ruckus.brouhaha.com>
In reply to	#13595

"A. K." <akk@nospam.org> writes:
> then CASEs
> n = 123 or 312 or 231 --> eps = 1
> n = 132 or 321 or 213 --> eps = -1
> otherwise --> eps = 0

With locals:

: eps { i j k -- n }
   i j - j k - * k i - * 2/ ;

[toc] | [prev] | [next] | [standalone]

#13665

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-07 05:36 -0700
Message-ID	<5728b7da-2bef-4cad-bfd4-664ba077aafc@b20g2000yqg.googlegroups.com>
In reply to	#13656

On Jul 6, 10:46 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "A. K." <a...@nospam.org> writes:
> > then CASEs
> > n = 123 or 312 or 231 --> eps = 1
> > n = 132 or 321 or 213 --> eps = -1
> > otherwise --> eps = 0
>
> With locals:
>
> : eps { i j k -- n }
>    i j - j k - * k i - * 2/ ;

Nice.

KM

[toc] | [prev] | [next] | [standalone]

#13652

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-06 18:18 -0700
Message-ID	<1e0e4259-7a48-4ca9-8ec0-84c532adda6a@e37g2000yqn.googlegroups.com>
In reply to	#13587

On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
> What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> Thanks,
> Arnold

I would use a precomputed binary lookup table -- there are only 27
elements x 2 bits/element = 54 bits, so the entire tensor should fit
into a single 64-bit word. Now, you need a mapping function from {i,
j, k} to the start bit position, but that's easy.

Krishna

[toc] | [prev] | [next] | [standalone]

#13653

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-06 18:33 -0700
Message-ID	<706a4816-489e-44eb-acd0-5bc352bb298d@b20g2000yqg.googlegroups.com>
In reply to	#13652

On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > Thanks,
> > Arnold
>
> I would use a precomputed binary lookup table -- there are only 27
> elements x 2 bits/element = 54 bits, so the entire tensor should fit
> into a single 64-bit word. Now, you need a mapping function from {i,
> j, k} to the start bit position, but that's easy.
>
> Krishna

Spoke too soon. A.K.'s method is easier.

Krishna

[toc] | [prev] | [next] | [standalone]

#13668

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-07 07:28 -0700
Message-ID	<3d6d29ee-9f0d-4012-9cee-9be429d600ac@a34g2000yqi.googlegroups.com>
In reply to	#13652

On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > Thanks,
> > Arnold
>
> I would use a precomputed binary lookup table -- there are only 27
> elements x 2 bits/element = 54 bits, so the entire tensor should fit
> into a single 64-bit word. Now, you need a mapping function from {i,
> j, k} to the start bit position, but that's easy.
>
> Krishna

Just for fun, I went ahead and implemented a lookup table version
also.

\ WARNING: this version assumes a minimum 32 bit cell size and
\ little endian ordering! Also, i,j,k are zero-based!

\ We will use 3 cells to hold the tensor. Each cell will
\ represent one plane of the 3D tensor. k will index the cell.
\ i and j will index the bits in the cell.

create lctable  1639701 , 1316133 , 1381905 ,

\ i, j, k are zero based!

: lcsym4 ( i j k -- n )
    cells lctable + @  \ return the k^th tensor plane
    >r
    3 lshift over 2* + nip  \ compute the bit offset
    r> swap rshift 3 and 1-
;

[toc] | [prev] | [next] | [standalone]

#13670

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-07 08:02 -0700
Message-ID	<f2156282-b4ba-483d-8a7b-08e300cfcbc0@h10g2000yqn.googlegroups.com>
In reply to	#13668

On Jul 7, 9:28 am, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
>
> > On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > > Thanks,
> > > Arnold
>
> > I would use a precomputed binary lookup table -- there are only 27
> > elements x 2 bits/element = 54 bits, so the entire tensor should fit
> > into a single 64-bit word. Now, you need a mapping function from {i,
> > j, k} to the start bit position, but that's easy.
>
> > Krishna
>
> Just for fun, I went ahead and implemented a lookup table version
> also.
>
> \ WARNING: this version assumes a minimum 32 bit cell size and
> \ little endian ordering! Also, i,j,k are zero-based!
>

The warning is incorrect. The code should work fine on a big endian
system as well, provided the cell size is at least 32 bits.

KM

[toc] | [prev] | [next] | [standalone]

#13678

From	Albert van der Horst <albert@spenarnc.xs4all.nl>
Date	2012-07-07 19:58 +0000
Message-ID	<m6t3i7.6ey@spenarnc.xs4all.nl>
In reply to	#13587

In article <jt3erh$cd2$1@dont-email.me>,
Arnold Doray  <invalid@invalid.com> wrote:
>What's the best way to implement the 3D permutation tensor E(I,J,K)?

Something along
: eps
        3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
        / At this point we know that all indices are different.
        BEGIN DUP 3 <> WHILE ROT REPEAT
        DROP < 2* 1-
;


Come on boys, cases!

>
>Thanks,
>Arnold


--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]

#13706

From	Krishna Myneni <krishna.myneni@ccreweb.org>
Date	2012-07-08 05:45 -0700
Message-ID	<6774ceb8-dbe5-47db-890f-2f49eddcfe3b@d24g2000yqh.googlegroups.com>
In reply to	#13678

On Jul 7, 2:58 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
wrote:
> In article <jt3erh$cd...@dont-email.me>,
> Arnold Doray  <inva...@invalid.com> wrote:
>
> >What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> Something along
> : eps
>         3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
>         / At this point we know that all indices are different.
>         BEGIN DUP 3 <> WHILE ROT REPEAT
>         DROP < 2* 1-
> ;
>

The above seems to go into an infinite loop for me.

: epsh ( i j k -- n )
        ( 3DUP) 2dup 2>r 2 pick 2r>
        + + 3 MOD IF ( 3DROP) 2drop drop 0 EXIT THEN
        \ At this point we know that all indices are different.
        BEGIN DUP 3 <> WHILE ROT REPEAT
        DROP < 2* 1-
;

1 1 1 epsh


KM

[toc] | [prev] | [next] | [standalone]

#13713

From	Albert van der Horst <albert@spenarnc.xs4all.nl>
Date	2012-07-08 20:05 +0000
Message-ID	<m6uyhe.n7x@spenarnc.xs4all.nl>
In reply to	#13706

In article <6774ceb8-dbe5-47db-890f-2f49eddcfe3b@d24g2000yqh.googlegroups.com>,
Krishna Myneni  <krishna.myneni@ccreweb.org> wrote:
>On Jul 7, 2:58=A0pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
>wrote:
>> In article <jt3erh$cd...@dont-email.me>,
>> Arnold Doray =A0<inva...@invalid.com> wrote:
>>
>> >What's the best way to implement the 3D permutation tensor E(I,J,K)?
>>
>> Something along
>> : eps
>> =A0 =A0 =A0 =A0 3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
>> =A0 =A0 =A0 =A0 / At this point we know that all indices are different.
>> =A0 =A0 =A0 =A0 BEGIN DUP 3 <> WHILE ROT REPEAT
>> =A0 =A0 =A0 =A0 DROP < 2* 1-
>> ;
>>
>
>The above seems to go into an infinite loop for me.
>
>: epsh ( i j k -- n )
>        ( 3DUP) 2dup 2>r 2 pick 2r>
>        + + 3 MOD IF ( 3DROP) 2drop drop 0 EXIT THEN
>        \ At this point we know that all indices are different.
>        BEGIN DUP 3 <> WHILE ROT REPEAT
>        DROP < 2* 1-
>;
>
>1 1 1 epsh

Too hasty! I overlooked the possibility of 3 equals and didn't
test.

>
>
>KM
>

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

csiph-web

Permutation Tensor

Contents

#13680

#13681

#13685

#13695

#13702

#13732

#13754

#13765

#13792

#13838

#13771

#13656

#13665

#13652

#13653

#13668

#13670

#13678

#13706

#13713