Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.forth > #134532

C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups comp.lang.forth
Subject C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Date 2026-01-24 11:28 +0000
Organization Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID <2026Jan24.122830@mips.complang.tuwien.ac.at> (permalink)
References <69688c01$1@news.ausics.net> <2026Jan15.130413@mips.complang.tuwien.ac.at> <nnd$3a148ef5$137ee4b5@b1e8191b89e23503> <2026Jan16.183803@mips.complang.tuwien.ac.at> <nnd$7cecfc2e$135c60e6@11ec9b68cac8aeb0>

Show all headers | View raw


Hans Bezemer <the.beez.speaks@gmail.com> writes:
>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
>I thought, let's make this simple and execute ALL benchmarks I got. Some 
>of them have become useless, though for the simple reason hardware has 
>become that much better.
>
>But still, here it is. Overall, the performance consistently 
>deteriorates, aka -O3 gives the best performance.

Which compiler and which hardware?

For a random program, I would expect higher optimization levels to
produe faster code.  For a Forth system and these recent gccs, the
auto-vectorization of adjacent memory accesses may lead to similar
problems as in the C bubble-sort benchmark.  In Gforth, this actually
happens unless we disable vectorization (which we normally do), and,
moreover, with the vectorized code, gcc introduces additional
inefficiencies (see below).

Here's the output of ./gforth-fast onebench.fs compiled from the
current development version with gcc-12.2 and running on a Ryzen 5800X
(numbers are times, lower is better):

 sieve bubble matrix   fib   fft gcc options
 0.025  0.023  0.013 0.033 0.016 -O2
 0.025  0.023  0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
 0.404  0.418  0.377 0.472 0.244 -O3 (with auto vectorization)
 0.145  0.122  0.124 0.122 0.073 gforth default, using --no-dynamic

So how is the code different?  Here's the code for ROT:


-O3 (auto-vectorized)       -O3 -fno-tree-vec... -O2
add        $0x8,%rbx        add $0x8,%rbx        add $0x8,%rbx      
movq       0x8(%r10),%xmm1  mov 0x8(%r10),%rdx   mov 0x8(%r10),%rdx 
mov        0x10(%r10),%rcx  mov 0x10(%r10),%rax  mov 0x10(%r10),%rax
punpcklqdq %xmm1,%xmm1      mov %r13,0x8(%r10)   mov %r13,0x8(%r10) 
punpckhqdq %xmm1,%xmm0      mov %rdx,0x10(%r10)  mov %rdx,0x10(%r10)
movups     %xmm0,0x8(%r10)  mov %rax,%r13        mov %rax,%r13      
mov        (%rbx),%rax      mov (%rbx),%rax      mov (%rbx),%rax    
mov        %r14,0x8(%rsp)   jmp *%rax            jmp *%rax          
mov        %rax,%r11
mov        %r15,%r9
mov        %rcx,0x10(%rsp)
jmp        0x55bff2a58a99

So in this case -O3 without auto-vectorization generates the same code
as -O2.  Auto-vectorization, OTOH, replaces

mov 0x8(%r10),%rdx
mov 0x10(%r10),%rax

with

movq       0x8(%r10),%xmm1

and then performs the rotation with the punpck instructions, finally
storing two cells into memory with movups.  For some reason it also
separately loads 0x10(%r10) into %rcx (instead of extracting it from
%xmm1), and eventually stores it to 0x10(%rsp), which seems to be one
of the locations of the TOS.

I expect that gcc's auto-vectorization will do similar things to
primitives like ROT 2! 2SWAP (all of which are hit in gforth) in other
Forth systems with a C substrate, because they all tend to access two
(or more) adjacent cells.

But the big hit with the auto-vectorized code is not these changes,
but what happens at the end of the primitive: without
auto-vectorization there is the indirect jump of the threaded-code
dispatch, but with auto-vectorization it jumps to 0x55bff2a58a99:

   0x000055bff2a58a99 <gforth_engine2+153>:     movq   0x8(%rsp),%xmm0
   0x000055bff2a58a9f <gforth_engine2+159>:     movq   %r9,%xmm1
   0x000055bff2a58aa4 <gforth_engine2+164>:     movhps 0x8(%rsp),%xmm1
   0x000055bff2a58aa9 <gforth_engine2+169>:     movhps 0x10(%rsp),%xmm0
   0x000055bff2a58aae <gforth_engine2+174>:     movhlps %xmm0,%xmm5
   0x000055bff2a58ab1 <gforth_engine2+177>:     movq   %xmm0,%r14
   0x000055bff2a58ab6 <gforth_engine2+182>:     movq   %xmm1,%r15
   0x000055bff2a58abb <gforth_engine2+187>:     movhps %xmm1,0x18(%rsp)
   0x000055bff2a58ac0 <gforth_engine2+192>:     movq   %xmm5,%r8
   0x000055bff2a58ac5 <gforth_engine2+197>:     mov    %r15,%rdi
   0x000055bff2a58ac8 <gforth_engine2+200>:     mov    %r14,%rsi
   0x000055bff2a58acb <gforth_engine2+203>:     mov    %r8,%rcx
   0x000055bff2a58ace <gforth_engine2+206>:     jmp    *%r11

We can see here that, among other things 0x10(%rsp) (the TOS) is
loaded into %xmm0 and then moved through %xmm5 into %r8 and the %rcx,
as well as through %r14 into %rsi so at the end TOS resides in all
those places.  And I see that other primitives expect the TOS in some
of those places, e.g. 1+:

-O3 (auto-vectorized)       -O3 -fno-tree-vec...
add $0x8,%rbx               add $0x8,%rbx  
lea 0x1(%r8),%rcx           add $0x1,%r13  
mov (%rbx),%rax             mov (%rbx),%rax
mov %r14,0x8(%rsp)          jmp *%rax      
mov %rax,%r11
mov %r15,%r9
mov %rcx,0x10(%rsp)
jmp 0x55bff2a58a99

Jumping to 0x55bff2a58a99 instead of performing an indirect jump
disables dynamic native code generation in Gforth and all the
optimizations that are based on it.  You can see in the --no-dynamic
line how much that costs.  The remaining factor of 3 is probably due
to the large number of additional instructions that are performed in
the auto-vectorized engine.

What is the 4th code for ROT with -O2 and -O3?

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/

Back to comp.lang.forth | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

EuroForth 2025 preliminary proceedings dxf <dxforth@gmail.com> - 2026-01-15 17:41 +1100
  Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-15 12:04 +0000
    Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-16 15:25 +0100
      Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-16 17:38 +0000
        Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-22 16:51 +0100
          C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 11:28 +0000
            Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 16:47 +0000
              Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) peter <peter.noreply@tin.it> - 2026-01-25 23:31 +0100
                Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-26 19:24 +0000
                Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) peter <peter.noreply@tin.it> - 2026-01-27 15:44 +0100
                Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-29 18:27 +0000
                Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) albert@spenarnc.xs4all.nl - 2026-01-30 13:20 +0100
                Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-30 18:00 +0000
      Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-16 23:10 -0800
        Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-17 16:58 +0100
          Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-17 20:21 -0800
            Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-18 15:26 +0100
          Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-18 22:17 +0000
        Re: EuroForth 2025 preliminary proceedings albert@spenarnc.xs4all.nl - 2026-01-18 16:34 +0100
          Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-20 00:35 -0800
            Re: EuroForth 2025 preliminary proceedings albert@spenarnc.xs4all.nl - 2026-01-20 12:12 +0100
            Coroutines in Forth Gerry Jackson <do-not-use@swldwa.uk> - 2026-04-02 20:59 +0100
              Re: Coroutines in Forth Paul Rubin <no.email@nospam.invalid> - 2026-04-04 18:02 -0700
                Re: Coroutines in Forth Paul Rubin <no.email@nospam.invalid> - 2026-04-04 21:21 -0700
        Re: EuroForth 2025 preliminary proceedings peter <peter.noreply@tin.it> - 2026-01-19 23:26 +0100
          Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-19 15:22 -0800
            Re: EuroForth 2025 preliminary proceedings peter <peter.noreply@tin.it> - 2026-01-20 10:44 +0100
            Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-20 22:36 +0000
          Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-20 00:33 -0800
          Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-20 22:17 +0000

csiph-web