Re: 3dup again

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups	comp.lang.forth
Subject	Re: 3dup again
Date	2025-10-04 08:04 +0000
Organization	Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID	<2025Oct4.100409@mips.complang.tuwien.ac.at> (permalink)
References	(1 earlier) <mk5b0uFubm8U1@mid.individual.net> <68de4aaa@news.ausics.net> <2025Oct2.224440@mips.complang.tuwien.ac.at> <nnd$36c0a5f5$2a9d9179@1179128ec025831e> <mk9i5hFld65U1@mid.individual.net>

Show all headers | View raw

minforth <minforth@gmx.net> writes:
>Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:
>> The problem with 3DUP is that it is actually used in context.
>> What is the data that is going to 3DUP ped? In my view this
>> amounts to double use of data that is in registers (32 in the riscv)
>> anyway, after an optimiser does his thing.
>> 
>
>Code inlining will mend it.

Inlining is important for Forth, but it does not make what has been
called an "analytical optimizer" unnecessary; on the contraray,
inlining increases the benefit we get from the analytical optimizer.
E.g., let's consider

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

: foo.1 3dup.1 + ! ;
: foo.2 3dup.2 + ! ;
: foo.3 3dup.3 + ! ;
: foo.4 3dup.4 + ! ;

The result produced by VFX64 is:

foo.1              foo.2              foo.3              foo.4
PUSH EBX           MOV  EDX, EBX      CALL 3DUP.3        MOV  EDX, EBX     
MOV  EBX, [ESP]    ADD  EBX, [EBP]    ADD  EBX, [EBP]    ADD  EBX, [EBP]   
POP  EDX           MOV  ECX, [EBP+04] MOV  EDX, [EBP+04] MOV  ECX, [EBP+04]
ADD  EDX, [EBP]    MOV  0 [EBX], ECX  MOV  0 [EBX], EDX  MOV  0 [EBX], ECX 
MOV  ECX, [EBP+04] MOV  EBX, EDX      MOV  EBX, [EBP+08] MOV  EBX, EDX     
MOV  0 [EDX], ECX  NEXT,              LEA  EBP, [EBP+0C] NEXT,             
NEXT,                                 NEXT,             

VFX is only analytical about the data stack, and as a consequence, the
implementations of 3dup that only use the data stack work best.  When
the return stack is used, as in 3dup.1/foo.1, VFX produces
instructions (PUSH for >R, MOV ..., [ESP] for R@ and POP for R>) for
the return-stack operations.  When locals are used, VFX actually
disables inlining and just calls 3DUP.3.

Other Forth systems make too little use of inlining, and I have to
resort to macros to simulate it.  We cannot use proper macros for
3dup.3 (the locals-using variant), so I used EVALUATE-based macros;
this is just for experimental use, not for production, don't do this
at home:-)

Let's see what VFX64 produces for FOO.3 with this:

FOO.3 
( 080C0C50    8BD4 )                  MOV       EDX, ESP
( 080C0C52    FF7504 )                PUSH      [EBP+04]
( 080C0C55    FF7500 )                PUSH      [EBP]
( 080C0C58    53 )                    PUSH      EBX
( 080C0C59    52 )                    PUSH      EDX
( 080C0C5A    57 )                    PUSH      EDI
( 080C0C5B    8BFC )                  MOV       EDI, ESP
( 080C0C5D    81EC00000000 )          SUB       ESP, 00000000
( 080C0C63    8B5D08 )                MOV       EBX, [EBP+08]
( 080C0C66    8D6D0C )                LEA       EBP, [EBP+0C]
( 080C0C69    8B5708 )                MOV       EDX, [EDI+08]
( 080C0C6C    03570C )                ADD       EDX, [EDI+0C]
( 080C0C6F    8B4F08 )                MOV       ECX, [EDI+08]
( 080C0C72    8B470C )                MOV       EAX, [EDI+0C]
( 080C0C75    8D6DEC )                LEA       EBP, [EBP+-14]
( 080C0C78    894D04 )                MOV       [EBP+04], ECX
( 080C0C7B    894508 )                MOV       [EBP+08], EAX
( 080C0C7E    8B4F10 )                MOV       ECX, [EDI+10]
( 080C0C81    894D0C )                MOV       [EBP+0C], ECX
( 080C0C84    895D10 )                MOV       [EBP+10], EBX
( 080C0C87    8BDA )                  MOV       EBX, EDX
( 080C0C89    8B5710 )                MOV       EDX, [EDI+10]
( 080C0C8C    895500 )                MOV       [EBP], EDX
( 080C0C8F    8B5500 )                MOV       EDX, [EBP]
( 080C0C92    8913 )                  MOV       0 [EBX], EDX
( 080C0C94    8B5D04 )                MOV       EBX, [EBP+04]
( 080C0C97    8D6D08 )                LEA       EBP, [EBP+08]
( 080C0C9A    8B6704 )                MOV       ESP, [EDI+04]
( 080C0C9D    8B3F )                  MOV       EDI, 0 [EDI]
( 080C0C9F    C3 )                    NEXT,

So inlining did not mend that.

Here's what lxf produces:

foo.1              foo.2              foo.3              foo.4
mov eax , ebx      mov eax , ebx      mov eax , ebx      mov eax , ebx     
add eax , [ebp]    add eax , [ebp]    add eax , [ebp]    add eax , [ebp]   
mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h]
mov [eax] , ecx    mov [eax] , ecx    mov [eax] , ecx    mov [eax] , ecx   
ret near           ret near           ret near           ret near          

So, because lxf is analytical about the return stack (and, through
that, about locals), inlining produces the same very good code in all
these cases.

You may notice that lxf produces a register-register move less than
VFX does for FOO.2/FOO.4.  That's because VFX decided to modify the
TOS register (and has to restore it later), whereas lxf decided to
modify a copy of that register.  One would have to make additional
observations to determine if lxf was just lucky here or if it
consistently makes the right decision in such cases.

And here's the code that gforth-fast (which does not have an
analytical optimizer) produces:

foo.1              foo.2              foo.3                foo.4
>r    1->0         third    1->1      >l >l 1->1           dup    1->1       
  mov -8[r14],r13    mov [r10],r13    >l    1->1             mov [r10],r13   
  sub r14,$08        sub r10,$08        mov -$08[rbp],r13    sub r10,$08     
2dup    0->2         mov r13,$18[r10]   mov rdx,$08[r10]   2over    1->3     
  mov r13,$10[r10] third    1->2        mov rax,rbp          mov r15,$18[r10]
  mov r15,$08[r10]   mov r15,$10[r10]   add r10,$10          mov r9,$10[r10] 
i    2->3          third    2->3        lea rbp,-$10[rbp]  rot    3->3       
  mov r9,[r14]       mov r9,$08[r10]    mov -$10[rax],rdx    mov rax,r13     
-rot    3->2       +    3->2            mov r13,[r10]        mov r13,r15     
  mov [r10],r9       add r15,r9       >l @local0 1->1        mov r15,r9      
  sub r10,$08      !    2->0          @local0    1->1        mov r9,rax      
r>    2->3           mov [r15],r13      mov rax,rbp        +    3->2         
  mov r9,[r14]     ;s    0->1           lea rbp,-$08[rbp]    add r15,r9      
  add r14,$08        mov r13,$08[r10]   mov -$08[rax],r13  !    2->0         
+    3->2            add r10,$08      @local1    1->2        mov [r15],r13   
  add r15,r9         mov rbx,[r14]      mov r15,$08[rbp]   ;s    0->1        
!    2->0            add r14,$08      @local2    2->3        mov r13,$08[r10]
  mov [r15],r13      mov rax,[rbx]      mov r9,$10[rbp]      add r10,$08     
;s    0->1           jmp eax          @local0    3->1        mov rbx,[r14]   
  mov r13,$08[r10]                      mov -$10[r10],r9     add r14,$08     
  add r10,$08                           sub r10,$18          mov rax,[rbx]   
  mov rbx,[r14]                         mov $10[r10],r15     jmp eax         
  add r14,$08                           mov $18[r10],r13 
  mov rax,[rbx]                         mov r13,$00[rbp] 
  jmp eax                             @local1    1->2    
                                        mov r15,$08[rbp] 
                                      @local2    2->3    
                                        mov r9,$10[rbp]  
                                      +    3->2          
                                        add r15,r9       
                                      !    2->0          
                                        mov [r15],r13    
                                      lit    0->1        
                                      #24                
                                        mov r13,$60[rbx] 
                                      lp+!    1->1       
                                        add r10,$08      
                                        add rbp,r13      
                                        mov r13,[r10]    
                                      ;s    1->1         
                                        mov rbx,[r14]    
                                        add r14,$08      
                                        mov rax,[rbx]    
                                        jmp eax          

Here inlining helps a little, but the disadvantages of the approach
are still obvious.  With less optimization (e.g., no stack caching),
inlining would have helped even less.

And while we are at it, here's SwiftForth:


foo.1             foo.2                 foo.3              foo.4
RBX PUSH          -8 [RBP] RBP LEA      -8 [RBP] RBP LEA   -8 [RBP] RBP LEA
0 [RBP] RBX MOV   RBX 0 [RBP] MOV       RBX 0 [RBP] MOV    RBX 0 [RBP] MOV
8 [RBP] RBP LEA   10 [RBP] RBX MOV      18 # EBX MOV       -10 [RBP] RBP LEA
-10 [RBP] RBP LEA -8 [RBP] RBP LEA      LSPACE CALL        RBX 8 [RBP] MOV
RBX 8 [RBP] MOV   RBX 0 [RBP] MOV       RBX 10 [R13] MOV   20 [RBP] RAX MOV
10 [RBP] RAX MOV  10 [RBP] RBX MOV      0 [RBP] RBX MOV    RAX 0 [RBP] MOV
RAX 0 [RBP] MOV   -8 [RBP] RBP LEA      8 [RBP] RBP LEA    18 [RBP] RBX MOV
-8 [RBP] RBP LEA  RBX 0 [RBP] MOV       RBX 8 [R13] MOV    RBX RCX MOV
RBX 0 [RBP] MOV   10 [RBP] RBX MOV      0 [RBP] RBX MOV    8 [RBP] RBX MOV
0 [RSP] RBX MOV   0 [RBP] RAX MOV       8 [RBP] RBP LEA    0 [RBP] RAX MOV
RBX RCX MOV       8 [RBP] RCX MOV       RBX 0 [R13] MOV    RAX 8 [RBP] MOV
8 [RBP] RAX MOV   RCX 0 [RBX] [RAX] MOV 0 [RBP] RBX MOV    RCX 0 [RBP] MOV
0 [RBP] RBX MOV   10 [RBP] RBX MOV      8 [RBP] RBP LEA    0 [RBP] RAX MOV
RAX 0 [RBP] MOV   18 [RBP] RBP LEA      -8 [RBP] RBP LEA   8 [RBP] RCX MOV
RCX 8 [RBP] MOV   RET                   RBX 0 [RBP] MOV    RCX 0 [RBX] [RAX] MOV
RAX POP                                 0 [R13] RBX MOV    10 [RBP] RBX MOV
RAX RBX ADD                             -8 [RBP] RBP LEA   18 [RBP] RBP LEA
0 [RBP] RAX MOV                         RBX 0 [RBP] MOV    RET
RAX 0 [RBX] MOV                         8 [R13] RBX MOV
8 [RBP] RBX MOV                         -8 [RBP] RBP LEA
10 [RBP] RBP LEA                        RBX 0 [RBP] MOV
RET                                     10 [R13] RBX MOV
                                        -8 [RBP] RBP LEA
                                        RBX 0 [RBP] MOV
                                        0 [R13] RBX MOV
                                        -8 [RBP] RBP LEA
                                        RBX 0 [RBP] MOV
                                        8 [R13] RBX MOV
                                        -8 [RBP] RBP LEA
                                        RBX 0 [RBP] MOV
                                        10 [R13] RBX MOV
                                        0 [RBP] RAX MOV
                                        8 [RBP] RCX MOV
                                        RCX 0 [RBX] [RAX] MOV
                                        10 [RBP] RBX MOV
                                        18 [RBP] RBP LEA
                                        RET


And finally, iForth:

foo.1/foo.4              foo.2                      foo.3
pop  rbx                 mov rbx, [rsp #16 +] qword pop  rbx
pop  rdi                 mov  rcx, rbx              lea rsi, [rsi #-16 +] qword
pop  rax                 mov rbx, [rsp 8 +] qword   mov [esi] dword, rbx
mov [edi ebx*1]dword,rax push rcx                   pop  rbx
push rax                 mov  rcx, rbx              lea rsi, [rsi #-16 +] qword
push rdi                 mov rbx, [rsp 8 +] qword   mov [esi] dword, rbx
push rbx                 pop  rdi                   pop  rbx
;                        mov [ecx ebx*1] dword, rdi lea rsi, [rsi #-16 +] qword
                         ;                          mov [esi] dword, rbx
                                                    mov rbx, [rsi #16 +] qword
                                                    add rbx, [rsi #32 +] qword
                                                    mov rdi, [rsi] qword
                                                    mov rax, [rsi #16 +] qword
                                                    mov rdx, [rsi #32 +] qword
                                                    mov [ebx] dword, rdi
                                                    push rdi
                                                    push rax
                                                    push rdx
                                                    add  rsi, #48 b#
                                                    ;

It's interesting that Gforth produced the same code for FOO.1 and
FOO.4, but different code for FOO.2.  Both variants are suboptimal
IMO, because the contain unnecessary pushes.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
 New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

Back to comp.lang.forth | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Generating a random sequence of Forth words anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-09-30 16:33 +0000
  Re: Generating a random sequence of Forth words minforth <minforth@gmx.net> - 2025-10-01 11:20 +0200
    Re: Generating a random sequence of Forth words anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-01 17:10 +0000
  Re: Generating a random sequence of Forth words Hans Bezemer <the.beez.speaks@gmail.com> - 2025-10-01 17:11 +0200
    Re: Generating a random sequence of Forth words minforth <minforth@gmx.net> - 2025-10-01 20:42 +0200
      Re: Generating a random sequence of Forth words dxf <dxforth@gmail.com> - 2025-10-02 19:49 +1000
        Re: Generating a random sequence of Forth words albert@spenarnc.xs4all.nl - 2025-10-02 13:07 +0200
          Re: Generating a random sequence of Forth words dxf <dxforth@gmail.com> - 2025-10-03 18:22 +1000
        3dup again (was: Generating a random sequence of Forth words) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-02 20:44 +0000
          Re: 3dup again (was: Generating a random sequence of Forth words) albert@spenarnc.xs4all.nl - 2025-10-03 11:02 +0200
            Re: 3dup again minforth <minforth@gmx.net> - 2025-10-03 11:09 +0200
              Re: 3dup again anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-04 08:04 +0000
          Re: 3dup again Hans Bezemer <the.beez.speaks@gmail.com> - 2025-10-05 11:29 +0200
  Re: Generating a random sequence of Forth words antispam@fricas.org (Waldek Hebisch) - 2025-10-15 19:19 +0000
    Re: Generating a random sequence of Forth words anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-24 15:55 +0000

csiph-web