Groups > comp.lang.forth > #134509 > unrolled thread

EuroForth 2025 preliminary proceedings

Started by	dxf <dxforth@gmail.com>
First post	2026-01-15 17:41 +1100
Last post	2026-01-20 22:17 +0000
Articles	20 on this page of 30 — 7 participants

Back to article view | Back to comp.lang.forth

  EuroForth 2025 preliminary proceedings dxf <dxforth@gmail.com> - 2026-01-15 17:41 +1100
    Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-15 12:04 +0000
      Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-16 15:25 +0100
        Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-16 17:38 +0000
          Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-22 16:51 +0100
            C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 11:28 +0000
              Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 16:47 +0000
                Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) peter <peter.noreply@tin.it> - 2026-01-25 23:31 +0100
                  Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-26 19:24 +0000
                    Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) peter <peter.noreply@tin.it> - 2026-01-27 15:44 +0100
                      Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-29 18:27 +0000
                        Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) albert@spenarnc.xs4all.nl - 2026-01-30 13:20 +0100
                          Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-30 18:00 +0000
        Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-16 23:10 -0800
          Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-17 16:58 +0100
            Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-17 20:21 -0800
              Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-18 15:26 +0100
            Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-18 22:17 +0000
          Re: EuroForth 2025 preliminary proceedings albert@spenarnc.xs4all.nl - 2026-01-18 16:34 +0100
            Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-20 00:35 -0800
              Re: EuroForth 2025 preliminary proceedings albert@spenarnc.xs4all.nl - 2026-01-20 12:12 +0100
              Coroutines in Forth Gerry Jackson <do-not-use@swldwa.uk> - 2026-04-02 20:59 +0100
                Re: Coroutines in Forth Paul Rubin <no.email@nospam.invalid> - 2026-04-04 18:02 -0700
                  Re: Coroutines in Forth Paul Rubin <no.email@nospam.invalid> - 2026-04-04 21:21 -0700
          Re: EuroForth 2025 preliminary proceedings peter <peter.noreply@tin.it> - 2026-01-19 23:26 +0100
            Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-19 15:22 -0800
              Re: EuroForth 2025 preliminary proceedings peter <peter.noreply@tin.it> - 2026-01-20 10:44 +0100
              Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-20 22:36 +0000
            Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-20 00:33 -0800
            Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-20 22:17 +0000

Page 1 of 2 [1] 2 Next page →

#134509 — EuroForth 2025 preliminary proceedings

From	dxf <dxforth@gmail.com>
Date	2026-01-15 17:41 +1100
Subject	EuroForth 2025 preliminary proceedings
Message-ID	<69688c01$1@news.ausics.net>

As I had trouble finding it, perhaps others too.  Here's the link:

http://www.euroforth.org/ef25/papers/

There is no link from the main page.

Someone had referenced Nick Nelson's 'Forth 2025' paper and I was curious
to read it.

[toc] | [next] | [standalone]

#134510

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-15 12:04 +0000
Message-ID	<2026Jan15.130413@mips.complang.tuwien.ac.at>
In reply to	#134509

dxf <dxforth@gmail.com> writes:
>As I had trouble finding it, perhaps others too.  Here's the link:
>
>http://www.euroforth.org/ef25/papers/
>
>There is no link from the main page.

Thank you.  As it happens, yesterday I created the post-conference
proceedings that includes a late paper and the slides that were
provided by their authors (not that many; apparently many authors are
content with the prospect of their presentation being preserved on
video).  I have now updated various links for the post-conference
state (link from www.euroforth.org to proceedings, and from the
proceedings to the euro.theforth.net page).

Unfortunately, the videos are not yet available.  Gerald Wodni has not
yet had the time to process them.  He mentioned something like "after
January" or somesuch.

I think that submitting slides has not just the advantage that they
are published earlier in this case (or at all in the 2023 case, where
the audio was so problematic that most videos were not published), but
also that one can read them faster than watch a video; the videos have
the audio track and interactive demos in addition to the text and
graphics of the slides, though.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [next] | [standalone]

#134511

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2026-01-16 15:25 +0100
Message-ID	<nnd$3a148ef5$137ee4b5@b1e8191b89e23503>
In reply to	#134510

On 15-01-2026 13:04, Anton Ertl wrote:

A few observations concerning the IMHO most interesting paper, 
"Code-Copying Compilation in Production":

1. Code copying indeed makes a big difference, overall I estimate about 
twice as fast;
2. The performance of VFX Forth continues to impress me, keeping up 
nicely, even with C compiled code;
3. Commercial compilers (partly) using conventional compilers (see TF, 
fig. 4.7) - that was new to me;
4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me 
too. I might experiment with that one;
5. I added GCC extension support to 4tH in version 3.62.0. At the time, 
it improved performance by about 25%. By accident I found out that was 
no longer true. switch() based was faster. I didn't know there had been 
changes in that regard to GCC.

Hans Bezemer

> dxf <dxforth@gmail.com> writes:
>> As I had trouble finding it, perhaps others too.  Here's the link:
>>
>> http://www.euroforth.org/ef25/papers/
>>
>> There is no link from the main page.
> 
> Thank you.  As it happens, yesterday I created the post-conference
> proceedings that includes a late paper and the slides that were
> provided by their authors (not that many; apparently many authors are
> content with the prospect of their presentation being preserved on
> video).  I have now updated various links for the post-conference
> state (link from www.euroforth.org to proceedings, and from the
> proceedings to the euro.theforth.net page).
> 
> Unfortunately, the videos are not yet available.  Gerald Wodni has not
> yet had the time to process them.  He mentioned something like "after
> January" or somesuch.
> 
> I think that submitting slides has not just the advantage that they
> are published earlier in this case (or at all in the 2023 case, where
> the audio was so problematic that most videos were not published), but
> also that one can read them faster than watch a video; the videos have
> the audio track and interactive demos in addition to the text and
> graphics of the slides, though.
> 
> - anton

[toc] | [prev] | [next] | [standalone]

#134512

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-16 17:38 +0000
Message-ID	<2026Jan16.183803@mips.complang.tuwien.ac.at>
In reply to	#134511

Hans Bezemer <the.beez.speaks@gmail.com> writes:
>On 15-01-2026 13:04, Anton Ertl wrote:
>
>A few observations concerning the IMHO most interesting paper, 
>"Code-Copying Compilation in Production":
...
>3. Commercial compilers (partly) using conventional compilers (see TF, 
>fig. 4.7) - that was new to me;

All Forth compilers I know work at the text interpretation level as
the "Forth compiler" of Thinking Forth, Figure 4.7.

>4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me 
>too. I might experiment with that one;

I have analyzed it for bubblesort.  There the problem is that gcc -O3
auto-vectorizes the pair of loads and the pair of stores (when the two
elements are swapped).  As a result, if a pair is stored in one
iteration, the next iteration loads a pair that overlaps the
previously stored pair.  This means that the hardware cannot use its
fast path in store-to-load forwarding, and leads to a huge slowdown.
For a benchmark that has been around for over 40 years.

In addition, the code generated by gcc -O3 also executes several
additonal instructions per iteration, so I doubt that it would be
faster even if the store-to-load forwarding problem did not exist.

For fib, I have also looked at the generated code, but have not
understood it well enough to see why the code generated by gcc -O3 is
slower.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [next] | [standalone]

#134531

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2026-01-22 16:51 +0100
Message-ID	<nnd$7cecfc2e$135c60e6@11ec9b68cac8aeb0>
In reply to	#134512

On 16-01-2026 18:38, Anton Ertl wrote:

On 17-01-2026 16:58, Hans Bezemer wrote:

I've done my thing, compiled 4tH with optimizations -O3 till -O0.
I thought, let's make this simple and execute ALL benchmarks I got. Some 
of them have become useless, though for the simple reason hardware has 
become that much better.

But still, here it is. Overall, the performance consistently 
deteriorates, aka -O3 gives the best performance. There are a few minor 
glitches, some due to random benchmark data.

For those curious, this is a European CSV with all the data. BTW, you 
can find all benchmarks here: 
https://sourceforge.net/p/forth-4th/code/HEAD/tree/trunk/4th.src/bench/

Hans Bezemer

---8<---
Benchmark;-O3;-O2;-O1;-O0
bench.4th;6.79;6.36;6.68;6.33
benchm.4th;1.21;1.66;1.86;2.8
benchxls.4th;0.06;0.08;0.08;0.12
bubble.4th;0.69;0.95;0.96;1.72
bytesiev.4th;0.01;0.01;0.01;0.02
countbit.4th;3.52;4.76;5.02;8.01
cowell.4th;15.15;20.2;18.91;31.29
fib.4th;0.79;1.02;1.02;1.72
isortest.4th;0.23;0.33;0.31;0.56
matrix.4th;0.22;0.31;0.3;0.51
misty.4th;0.58;0.84;1.01;1.59
pforth.4th;10.47;13.55;14.42;22.68
prims.4th;5.96;8;8.59;14.28
simple.4th;0.5;0.7;0.82;1.21
sortest.4th;140.96;163.68;150.17;270.87
thread.4th;0.35;0.41;0.49;0.7
---8<---

> Hans Bezemer <the.beez.speaks@gmail.com> writes:
>> On 15-01-2026 13:04, Anton Ertl wrote:
>>
>> A few observations concerning the IMHO most interesting paper,
>> "Code-Copying Compilation in Production":
> ...
>> 3. Commercial compilers (partly) using conventional compilers (see TF,
>> fig. 4.7) - that was new to me;
> 
> All Forth compilers I know work at the text interpretation level as
> the "Forth compiler" of Thinking Forth, Figure 4.7.
> 
>> 4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
>> too. I might experiment with that one;
> 
> I have analyzed it for bubblesort.  There the problem is that gcc -O3
> auto-vectorizes the pair of loads and the pair of stores (when the two
> elements are swapped).  As a result, if a pair is stored in one
> iteration, the next iteration loads a pair that overlaps the
> previously stored pair.  This means that the hardware cannot use its
> fast path in store-to-load forwarding, and leads to a huge slowdown.
> For a benchmark that has been around for over 40 years.
> 
> In addition, the code generated by gcc -O3 also executes several
> additonal instructions per iteration, so I doubt that it would be
> faster even if the store-to-load forwarding problem did not exist.
> 
> For fib, I have also looked at the generated code, but have not
> understood it well enough to see why the code generated by gcc -O3 is
> slower.
> 
> - anton

[toc] | [prev] | [next] | [standalone]

#134532 — C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-24 11:28 +0000
Subject	C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<2026Jan24.122830@mips.complang.tuwien.ac.at>
In reply to	#134531

Hans Bezemer <the.beez.speaks@gmail.com> writes:
>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
>I thought, let's make this simple and execute ALL benchmarks I got. Some 
>of them have become useless, though for the simple reason hardware has 
>become that much better.
>
>But still, here it is. Overall, the performance consistently 
>deteriorates, aka -O3 gives the best performance.

Which compiler and which hardware?

For a random program, I would expect higher optimization levels to
produe faster code.  For a Forth system and these recent gccs, the
auto-vectorization of adjacent memory accesses may lead to similar
problems as in the C bubble-sort benchmark.  In Gforth, this actually
happens unless we disable vectorization (which we normally do), and,
moreover, with the vectorized code, gcc introduces additional
inefficiencies (see below).

Here's the output of ./gforth-fast onebench.fs compiled from the
current development version with gcc-12.2 and running on a Ryzen 5800X
(numbers are times, lower is better):

 sieve bubble matrix   fib   fft gcc options
 0.025  0.023  0.013 0.033 0.016 -O2
 0.025  0.023  0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
 0.404  0.418  0.377 0.472 0.244 -O3 (with auto vectorization)
 0.145  0.122  0.124 0.122 0.073 gforth default, using --no-dynamic

So how is the code different?  Here's the code for ROT:

-O3 (auto-vectorized)       -O3 -fno-tree-vec... -O2
add        $0x8,%rbx        add $0x8,%rbx        add $0x8,%rbx      
movq       0x8(%r10),%xmm1  mov 0x8(%r10),%rdx   mov 0x8(%r10),%rdx 
mov        0x10(%r10),%rcx  mov 0x10(%r10),%rax  mov 0x10(%r10),%rax
punpcklqdq %xmm1,%xmm1      mov %r13,0x8(%r10)   mov %r13,0x8(%r10) 
punpckhqdq %xmm1,%xmm0      mov %rdx,0x10(%r10)  mov %rdx,0x10(%r10)
movups     %xmm0,0x8(%r10)  mov %rax,%r13        mov %rax,%r13      
mov        (%rbx),%rax      mov (%rbx),%rax      mov (%rbx),%rax    
mov        %r14,0x8(%rsp)   jmp *%rax            jmp *%rax          
mov        %rax,%r11
mov        %r15,%r9
mov        %rcx,0x10(%rsp)
jmp        0x55bff2a58a99

So in this case -O3 without auto-vectorization generates the same code
as -O2.  Auto-vectorization, OTOH, replaces

mov 0x8(%r10),%rdx
mov 0x10(%r10),%rax

with

movq       0x8(%r10),%xmm1

and then performs the rotation with the punpck instructions, finally
storing two cells into memory with movups.  For some reason it also
separately loads 0x10(%r10) into %rcx (instead of extracting it from
%xmm1), and eventually stores it to 0x10(%rsp), which seems to be one
of the locations of the TOS.

I expect that gcc's auto-vectorization will do similar things to
primitives like ROT 2! 2SWAP (all of which are hit in gforth) in other
Forth systems with a C substrate, because they all tend to access two
(or more) adjacent cells.

But the big hit with the auto-vectorized code is not these changes,
but what happens at the end of the primitive: without
auto-vectorization there is the indirect jump of the threaded-code
dispatch, but with auto-vectorization it jumps to 0x55bff2a58a99:

   0x000055bff2a58a99 <gforth_engine2+153>:     movq   0x8(%rsp),%xmm0
   0x000055bff2a58a9f <gforth_engine2+159>:     movq   %r9,%xmm1
   0x000055bff2a58aa4 <gforth_engine2+164>:     movhps 0x8(%rsp),%xmm1
   0x000055bff2a58aa9 <gforth_engine2+169>:     movhps 0x10(%rsp),%xmm0
   0x000055bff2a58aae <gforth_engine2+174>:     movhlps %xmm0,%xmm5
   0x000055bff2a58ab1 <gforth_engine2+177>:     movq   %xmm0,%r14
   0x000055bff2a58ab6 <gforth_engine2+182>:     movq   %xmm1,%r15
   0x000055bff2a58abb <gforth_engine2+187>:     movhps %xmm1,0x18(%rsp)
   0x000055bff2a58ac0 <gforth_engine2+192>:     movq   %xmm5,%r8
   0x000055bff2a58ac5 <gforth_engine2+197>:     mov    %r15,%rdi
   0x000055bff2a58ac8 <gforth_engine2+200>:     mov    %r14,%rsi
   0x000055bff2a58acb <gforth_engine2+203>:     mov    %r8,%rcx
   0x000055bff2a58ace <gforth_engine2+206>:     jmp    *%r11

We can see here that, among other things 0x10(%rsp) (the TOS) is
loaded into %xmm0 and then moved through %xmm5 into %r8 and the %rcx,
as well as through %r14 into %rsi so at the end TOS resides in all
those places.  And I see that other primitives expect the TOS in some
of those places, e.g. 1+:

-O3 (auto-vectorized)       -O3 -fno-tree-vec...
add $0x8,%rbx               add $0x8,%rbx  
lea 0x1(%r8),%rcx           add $0x1,%r13  
mov (%rbx),%rax             mov (%rbx),%rax
mov %r14,0x8(%rsp)          jmp *%rax      
mov %rax,%r11
mov %r15,%r9
mov %rcx,0x10(%rsp)
jmp 0x55bff2a58a99

Jumping to 0x55bff2a58a99 instead of performing an indirect jump
disables dynamic native code generation in Gforth and all the
optimizations that are based on it.  You can see in the --no-dynamic
line how much that costs.  The remaining factor of 3 is probably due
to the large number of additional instructions that are performed in
the auto-vectorized engine.

What is the 4th code for ROT with -O2 and -O3?

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/

[toc] | [prev] | [next] | [standalone]

#134533 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-24 16:47 +0000
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<2026Jan24.174716@mips.complang.tuwien.ac.at>
In reply to	#134532

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Hans Bezemer <the.beez.speaks@gmail.com> writes:
>>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
>>I thought, let's make this simple and execute ALL benchmarks I got. Some 
>>of them have become useless, though for the simple reason hardware has 
>>become that much better.
>>
>>But still, here it is. Overall, the performance consistently 
>>deteriorates, aka -O3 gives the best performance.
>
>Which compiler and which hardware?
>
>For a random program, I would expect higher optimization levels to
>produe faster code.  For a Forth system and these recent gccs, the
>auto-vectorization of adjacent memory accesses may lead to similar
>problems as in the C bubble-sort benchmark.  In Gforth, this actually
>happens unless we disable vectorization (which we normally do), and,
>moreover, with the vectorized code, gcc introduces additional
>inefficiencies (see below).
>
>Here's the output of ./gforth-fast onebench.fs compiled from the
>current development version with gcc-12.2 and running on a Ryzen 5800X
>(numbers are times, lower is better):
>
> sieve bubble matrix   fib   fft gcc options
> 0.025  0.023  0.013 0.033 0.016 -O2
> 0.025  0.023  0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
> 0.404  0.418  0.377 0.472 0.244 -O3 (with auto vectorization)
> 0.145  0.122  0.124 0.122 0.073 gforth default, using --no-dynamic

I have now also tried it with gcc-14.2, and that produces better code.
Results from a Xeon E-2388G (Rocket Lake):

 sieve bubble matrix   fib   fft gcc options
 0.032  0.032  0.015 0.037 0.014 -O2 
 0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
 0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)

The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:

-O3 (auto-vectorized)     -O3 -fno-tree-vectorize
add    $0x8,%rbx          add $0x8,%rbx      
movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax
add    $0x18,%r13         mov 0x8(%r13),%rdx 
movhps -0x8(%r13),%xmm0   add $0x18,%r13     
movups %xmm0,(%r8)        mov %rdx,(%r8)     
mov    0x0(%r13),%r8      mov %rax,0x8(%r8)  
mov    (%rbx),%rax        mov 0x0(%r13),%r8  
jmp    *%rax              mov (%rbx),%rax    
                          jmp *%rax          

And the common tail with all these move instructions is gone.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/

[toc] | [prev] | [next] | [standalone]

#134534 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	peter <peter.noreply@tin.it>
Date	2026-01-25 23:31 +0100
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<20260125233110.000034b4@tin.it>
In reply to	#134533

On Sat, 24 Jan 2026 16:47:16 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >Hans Bezemer <the.beez.speaks@gmail.com> writes:
> >>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
> >>I thought, let's make this simple and execute ALL benchmarks I got. Some 
> >>of them have become useless, though for the simple reason hardware has 
> >>become that much better.
> >>
> >>But still, here it is. Overall, the performance consistently 
> >>deteriorates, aka -O3 gives the best performance.
> >
> >Which compiler and which hardware?
> >
> >For a random program, I would expect higher optimization levels to
> >produe faster code.  For a Forth system and these recent gccs, the
> >auto-vectorization of adjacent memory accesses may lead to similar
> >problems as in the C bubble-sort benchmark.  In Gforth, this actually
> >happens unless we disable vectorization (which we normally do), and,
> >moreover, with the vectorized code, gcc introduces additional
> >inefficiencies (see below).
> >
> >Here's the output of ./gforth-fast onebench.fs compiled from the
> >current development version with gcc-12.2 and running on a Ryzen 5800X
> >(numbers are times, lower is better):
> >
> > sieve bubble matrix   fib   fft gcc options
> > 0.025  0.023  0.013 0.033 0.016 -O2
> > 0.025  0.023  0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
> > 0.404  0.418  0.377 0.472 0.244 -O3 (with auto vectorization)
> > 0.145  0.122  0.124 0.122 0.073 gforth default, using --no-dynamic
> 
> I have now also tried it with gcc-14.2, and that produces better code.
> Results from a Xeon E-2388G (Rocket Lake):
> 
>  sieve bubble matrix   fib   fft gcc options
>  0.032  0.032  0.015 0.037 0.014 -O2 
>  0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
>  0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)
> 
> The code for ROT and 2SWAP does not use auto-vectorization, and the
> code for 2! uses auto-vectorization in a way that reduces the
> instruction count:
> 
> -O3 (auto-vectorized)     -O3 -fno-tree-vectorize
> add    $0x8,%rbx          add $0x8,%rbx      
> movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax
> add    $0x18,%r13         mov 0x8(%r13),%rdx 
> movhps -0x8(%r13),%xmm0   add $0x18,%r13     
> movups %xmm0,(%r8)        mov %rdx,(%r8)     
> mov    0x0(%r13),%r8      mov %rax,0x8(%r8)  
> mov    (%rbx),%rax        mov 0x0(%r13),%r8  
> jmp    *%rax              mov (%rbx),%rax    
>                           jmp *%rax          
> 
> And the common tail with all these move instructions is gone.
> 
> - anton

What does your C code looks like? I could not get clang or gcc to auto vectories
with my existing code

  	UNS64 *tmp64 = (UNS64*)TOP; 
        tmp64[0] = sp[0]; 
        tmp64[1] = sp[1]; 
        TOP = sp[2]; 
        sp += 3;


In the end I changed my code to tell the compiler that it is a vector with

typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

and
        *(v2u64*)TOP = *(v2u64*)sp;
        TOP=sp[2];
        sp=sp+3; 

this will produce

	vmovups	xmm0, xmmword ptr [rdx]
	vmovups	xmmword ptr [r8], xmm0
	mov	r8, qword ptr [rdx + 16]
	add	rdx, 24

	movzx	r9d, byte ptr [rcx]	// nesting code
	inc	rcx
	jmp	qword ptr [rax + 8*r9]

But also using memcpy((UNS64*)TOP, (UNS64*)sp,16); gives the same code!

Looks like it is working also in ARM64
BR
Peter

[toc] | [prev] | [next] | [standalone]

#134536 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-26 19:24 +0000
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<2026Jan26.202443@mips.complang.tuwien.ac.at>
In reply to	#134534

peter <peter.noreply@tin.it> writes:
>On Sat, 24 Jan 2026 16:47:16 GMT
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> I have now also tried it with gcc-14.2, and that produces better code.
>> Results from a Xeon E-2388G (Rocket Lake):
>> 
>>  sieve bubble matrix   fib   fft gcc options
>>  0.032  0.032  0.015 0.037 0.014 -O2 
>>  0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
>>  0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)
>> 
>> The code for ROT and 2SWAP does not use auto-vectorization, and the
>> code for 2! uses auto-vectorization in a way that reduces the
>> instruction count:
>> 
>> -O3 (auto-vectorized)     -O3 -fno-tree-vectorize
>> add    $0x8,%rbx          add $0x8,%rbx      
>> movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax
>> add    $0x18,%r13         mov 0x8(%r13),%rdx 
>> movhps -0x8(%r13),%xmm0   add $0x18,%r13     
>> movups %xmm0,(%r8)        mov %rdx,(%r8)     
>> mov    0x0(%r13),%r8      mov %rax,0x8(%r8)  
>> mov    (%rbx),%rax        mov 0x0(%r13),%r8  
>> jmp    *%rax              mov (%rbx),%rax    
>>                           jmp *%rax          
>> 
>> And the common tail with all these move instructions is gone.
>> 
>> - anton
>
>What does your C code looks like? I could not get clang or gcc to auto vectories
>with my existing code
>
>  	UNS64 *tmp64 = (UNS64*)TOP; 
>        tmp64[0] = sp[0]; 
>        tmp64[1] = sp[1]; 
>        TOP = sp[2]; 
>        sp += 3;

Gforth's source code for 2! is:

2!	( w1 w2 a_addr -- )		core	two_store
""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""
a_addr[0] = w2;
a_addr[1] = w1;

A generator produces the following from that, which is passed to gcc:

LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1  */
/* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */
NAME("2!")
ip += 1;
LABEL1(two_store)
{
DEF_CA
MAYBE_UNUSED Cell w1;
MAYBE_UNUSED Cell w2;
MAYBE_UNUSED Cell * a_addr;
NEXT_P0;
vm_Cell2w(sp[2],w1);
vm_Cell2w(sp[1],w2);
vm_Cell2a_(spTOS,a_addr);
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" w1=", vm_out); printarg_w(w1);
fputs(" w2=", vm_out); printarg_w(w2);
fputs(" a_addr=", vm_out); printarg_a_(a_addr);
}
#endif
sp += 3;
{
#line 1815 "prim"
a_addr[0] = w2;
a_addr[1] = w1;
#line 10136 "prim-fast.i"
}

#ifdef VM_DEBUG
if (vm_debug) {
fputs(" -- ", vm_out); fputc('\n', vm_out);
}
#endif
NEXT_P1;
spTOS = sp[0];
LABEL2(two_store)
NAME1("l2-two_store")
NEXT_P1_5;
LABEL3(two_store)
NAME1("l3-two_store")
DO_GOTO;
}

There are a lot of macros in this code, and I fear that expanding them
makes the code even less readable, but the essence for the
auto-vectorized part is something like:

w1 = sp[2];
w2 = sp[1];
a_addr = spTOS;
sp += 3;
a_addr[0] = w2;
a_addr[1] = w1;
spTOS = sp[0];

My guess is that in your code the compiler expected that sp[1] might
alias with tmp64[0], and therefore did not vectorize the loads and the
stores, whereas in the Gforth code, the loads both happen first, and
then the two stores, and gcc can vectorize that.  I doubt that there
is a big benefit from that, though.

>typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

I'll have to remember the aligned attribute for future games with gcc
explicit vectorization.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/

[toc] | [prev] | [next] | [standalone]

#134538 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	peter <peter.noreply@tin.it>
Date	2026-01-27 15:44 +0100
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<20260127154455.00000f73@tin.it>
In reply to	#134536

On Mon, 26 Jan 2026 19:24:43 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> peter <peter.noreply@tin.it> writes:
> >On Sat, 24 Jan 2026 16:47:16 GMT
> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >> I have now also tried it with gcc-14.2, and that produces better code.
> >> Results from a Xeon E-2388G (Rocket Lake):
> >> 
> >>  sieve bubble matrix   fib   fft gcc options
> >>  0.032  0.032  0.015 0.037 0.014 -O2 
> >>  0.035  0.032  0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
> >>  0.033  0.034  0.016 0.032 0.014 -O3 (with auto vectorization)
> >> 
> >> The code for ROT and 2SWAP does not use auto-vectorization, and the
> >> code for 2! uses auto-vectorization in a way that reduces the
> >> instruction count:
> >> 
> >> -O3 (auto-vectorized)     -O3 -fno-tree-vectorize
> >> add    $0x8,%rbx          add $0x8,%rbx      
> >> movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax
> >> add    $0x18,%r13         mov 0x8(%r13),%rdx 
> >> movhps -0x8(%r13),%xmm0   add $0x18,%r13     
> >> movups %xmm0,(%r8)        mov %rdx,(%r8)     
> >> mov    0x0(%r13),%r8      mov %rax,0x8(%r8)  
> >> mov    (%rbx),%rax        mov 0x0(%r13),%r8  
> >> jmp    *%rax              mov (%rbx),%rax    
> >>                           jmp *%rax          
> >> 
> >> And the common tail with all these move instructions is gone.
> >> 
> >> - anton
> >
> >What does your C code looks like? I could not get clang or gcc to auto vectories
> >with my existing code
> >
> >  	UNS64 *tmp64 = (UNS64*)TOP; 
> >        tmp64[0] = sp[0]; 
> >        tmp64[1] = sp[1]; 
> >        TOP = sp[2]; 
> >        sp += 3;
> 
> Gforth's source code for 2! is:
> 
> 2!	( w1 w2 a_addr -- )		core	two_store
> ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""
> a_addr[0] = w2;
> a_addr[1] = w1;
> 
> A generator produces the following from that, which is passed to gcc:
> 
> LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1  */
> /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */
> NAME("2!")
> ip += 1;
> LABEL1(two_store)
> {
> DEF_CA
> MAYBE_UNUSED Cell w1;
> MAYBE_UNUSED Cell w2;
> MAYBE_UNUSED Cell * a_addr;
> NEXT_P0;
> vm_Cell2w(sp[2],w1);
> vm_Cell2w(sp[1],w2);
> vm_Cell2a_(spTOS,a_addr);
> #ifdef VM_DEBUG
> if (vm_debug) {
> fputs(" w1=", vm_out); printarg_w(w1);
> fputs(" w2=", vm_out); printarg_w(w2);
> fputs(" a_addr=", vm_out); printarg_a_(a_addr);
> }
> #endif
> sp += 3;
> {
> #line 1815 "prim"
> a_addr[0] = w2;
> a_addr[1] = w1;
> #line 10136 "prim-fast.i"
> }
> 
> #ifdef VM_DEBUG
> if (vm_debug) {
> fputs(" -- ", vm_out); fputc('\n', vm_out);
> }
> #endif
> NEXT_P1;
> spTOS = sp[0];
> LABEL2(two_store)
> NAME1("l2-two_store")
> NEXT_P1_5;
> LABEL3(two_store)
> NAME1("l3-two_store")
> DO_GOTO;
> }
> 
> There are a lot of macros in this code, and I fear that expanding them
> makes the code even less readable, but the essence for the
> auto-vectorized part is something like:
> 
> w1 = sp[2];
> w2 = sp[1];
> a_addr = spTOS;
> sp += 3;
> a_addr[0] = w2;
> a_addr[1] = w1;
> spTOS = sp[0];
> 
> My guess is that in your code the compiler expected that sp[1] might
> alias with tmp64[0], and therefore did not vectorize the loads and the
> stores, whereas in the Gforth code, the loads both happen first, and
> then the two stores, and gcc can vectorize that.  I doubt that there
> is a big benefit from that, though.

Yes that was it. changing to:

	UNS64 *tmp64 = (UNS64*)TOP;
        UNS64 d0=sp[0];
        UNS64 d1=sp[1];    
        tmp64[0] = d0; 
        tmp64[1] = d1; 
        TOP = sp[2]; 
        sp += 3;    

made the compiler (clang-21 in this case) generate the expected code


> 
> >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
> 
> I'll have to remember the aligned attribute for future games with gcc
> explicit vectorization.

Without that it will generate the opcodes that needs 16 byte alignment

BR
Peter

> - anton

[toc] | [prev] | [next] | [standalone]

#134542 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-29 18:27 +0000
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<2026Jan29.192712@mips.complang.tuwien.ac.at>
In reply to	#134538

peter <peter.noreply@tin.it> writes:
>On Mon, 26 Jan 2026 19:24:43 GMT
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> peter <peter.noreply@tin.it> writes:
>> >On Sat, 24 Jan 2026 16:47:16 GMT
>> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> >> The code for ROT and 2SWAP does not use auto-vectorization, and the
>> >> code for 2! uses auto-vectorization in a way that reduces the
>> >> instruction count:
>> >> 
>> >> -O3 (auto-vectorized)     -O3 -fno-tree-vectorize
>> >> add    $0x8,%rbx          add $0x8,%rbx      
>> >> movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax
>> >> add    $0x18,%r13         mov 0x8(%r13),%rdx 
>> >> movhps -0x8(%r13),%xmm0   add $0x18,%r13     
>> >> movups %xmm0,(%r8)        mov %rdx,(%r8)     
>> >> mov    0x0(%r13),%r8      mov %rax,0x8(%r8)  
>> >> mov    (%rbx),%rax        mov 0x0(%r13),%r8  
>> >> jmp    *%rax              mov (%rbx),%rax    
>> >>                           jmp *%rax          
...
>	UNS64 *tmp64 = (UNS64*)TOP;
>        UNS64 d0=sp[0];
>        UNS64 d1=sp[1];    
>        tmp64[0] = d0; 
>        tmp64[1] = d1; 
>        TOP = sp[2]; 
>        sp += 3;    
>
>made the compiler (clang-21 in this case) generate the expected code

The auto-vectorized implementation of 2! above should perform ok,
because it loads each stack item separately, and the wide movups is
only used for the stores.  If there is a wide load from the stack
involved, I expect a significant slowdown, because the stack items
usually have been stored recently, and narrow-store-to-wide-load
forwarding is a slow path on recent (and presumably also older) CPU
cores: https://www.complang.tuwien.ac.at/anton/stwlf/

>> >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
>> 
>> I'll have to remember the aligned attribute for future games with gcc
>> explicit vectorization.
>
>Without that it will generate the opcodes that needs 16 byte alignment

Yes.  Until now I worked around that by using memcpy to a vector
variable, but this approach is much more convenient.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/

[toc] | [prev] | [next] | [standalone]

#134544 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	albert@spenarnc.xs4all.nl
Date	2026-01-30 13:20 +0100
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<nnd$1f48c6e8$48795598@3884e8505482cce2>
In reply to	#134542

In article <2026Jan29.192712@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>peter <peter.noreply@tin.it> writes:
>>On Mon, 26 Jan 2026 19:24:43 GMT
>>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> peter <peter.noreply@tin.it> writes:
>>> >On Sat, 24 Jan 2026 16:47:16 GMT
>>> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>> >> The code for ROT and 2SWAP does not use auto-vectorization, and the
>>> >> code for 2! uses auto-vectorization in a way that reduces the
>>> >> instruction count:
>>> >>
>>> >> -O3 (auto-vectorized)     -O3 -fno-tree-vectorize
>>> >> add    $0x8,%rbx          add $0x8,%rbx
>>> >> movq   0x8(%r13),%xmm0    mov 0x10(%r13),%rax
>>> >> add    $0x18,%r13         mov 0x8(%r13),%rdx
>>> >> movhps -0x8(%r13),%xmm0   add $0x18,%r13
>>> >> movups %xmm0,(%r8)        mov %rdx,(%r8)
>>> >> mov    0x0(%r13),%r8      mov %rax,0x8(%r8)
>>> >> mov    (%rbx),%rax        mov 0x0(%r13),%r8
>>> >> jmp    *%rax              mov (%rbx),%rax
>>> >>                           jmp *%rax
>...
>>      UNS64 *tmp64 = (UNS64*)TOP;
>>        UNS64 d0=sp[0];
>>        UNS64 d1=sp[1];
>>        tmp64[0] = d0;
>>        tmp64[1] = d1;
>>        TOP = sp[2];
>>        sp += 3;
>>
>>made the compiler (clang-21 in this case) generate the expected code
>
>The auto-vectorized implementation of 2! above should perform ok,
>because it loads each stack item separately, and the wide movups is
>only used for the stores.  If there is a wide load from the stack
>involved, I expect a significant slowdown, because the stack items
>usually have been stored recently, and narrow-store-to-wide-load
>forwarding is a slow path on recent (and presumably also older) CPU
>cores: https://www.complang.tuwien.ac.at/anton/stwlf/
>
>>> >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
>>>
>>> I'll have to remember the aligned attribute for future games with gcc
>>> explicit vectorization.
>>
>>Without that it will generate the opcodes that needs 16 byte alignment
>
>Yes.  Until now I worked around that by using memcpy to a vector
>variable, but this approach is much more convenient.

I always wonder, is this relevant to the industrial applications of
gforth or gforth based programs that are sold commercially?

>
>- anton
-- 
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

[toc] | [prev] | [next] | [standalone]

#134548 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-30 18:00 +0000
Subject	Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)
Message-ID	<2026Jan30.190031@mips.complang.tuwien.ac.at>
In reply to	#134544

albert@spenarnc.xs4all.nl writes:
>I always wonder, is this relevant to the industrial applications of
>gforth or gforth based programs that are sold commercially?

Are there any Gforth-based programs that are sold commercially?
Concerning industrial applications, the only ones I know about have to
do with Open Firmware, and I doubt that those care much about the
performance of Gforth.  But there are probably industrial applications
(maybe even commercial programs) that use Gforth that I do not know
about.  If one of the IBM users had not contacted us when he left the
group, I would not know about the application of Gforth within IBM.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/

[toc] | [prev] | [next] | [standalone]

#134513

From	Paul Rubin <no.email@nospam.invalid>
Date	2026-01-16 23:10 -0800
Message-ID	<87wm1gpvdr.fsf@nightsong.com>
In reply to	#134511

Hans Bezemer <the.beez.speaks@gmail.com> writes:
> 5. I added GCC extension support to 4tH in version 3.62.0. At the
> time, it improved performance by about 25%. By accident I found out
> that was no longer true. switch() based was faster. I didn't know
> there had been changes in that regard to GCC.

If you mean the goto *a feature, these days you might try using tail
calls instead.  GCC and LLVM both now support a musttail attribute that
ensures this optimization, or signals a compile-time error if it can't.

https://lwn.net/Articles/1033373/

[toc] | [prev] | [next] | [standalone]

#134514

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2026-01-17 16:58 +0100
Message-ID	<nnd$4c8a9957$56c8fc09@4ad2500852ea2034>
In reply to	#134513

On 17-01-2026 08:10, Paul Rubin wrote:
> Hans Bezemer <the.beez.speaks@gmail.com> writes:
>> 5. I added GCC extension support to 4tH in version 3.62.0. At the
>> time, it improved performance by about 25%. By accident I found out
>> that was no longer true. switch() based was faster. I didn't know
>> there had been changes in that regard to GCC.
> 
> If you mean the goto *a feature, these days you might try using tail
> calls instead.  GCC and LLVM both now support a musttail attribute that
> ensures this optimization, or signals a compile-time error if it can't.
> 
> https://lwn.net/Articles/1033373/

Thanks for the article! But contrary to the Python interpreter, you 
could (thanks to some preprocessor magic) select how 4tH's VM would be 
compiled with NO changes to the source code whatsoever. That's why it 
could be reversed so easily by accident.

The tail call method however, requires an entirely different VM. That's 
a lot of work for about 10% performance improvement - that may not even 
last for a single GCC update. And requires two VM's to maintain..

So, I have to contemplate this carefully before putting work in it. But 
it's nice to know that I was not crazy noticing this ;-) And learning 
about a new GCC technique. :)

Hans Bezemer

[toc] | [prev] | [next] | [standalone]

#134515

From	Paul Rubin <no.email@nospam.invalid>
Date	2026-01-17 20:21 -0800
Message-ID	<87sec3pn3r.fsf@nightsong.com>
In reply to	#134514

Hans Bezemer <the.beez.speaks@gmail.com> writes:
> The tail call method however, requires an entirely different
> VM. That's a lot of work for about 10% performance improvement - that
> may not even last for a single GCC update. And requires two VM's to
> maintain..

You'd have to change the VM but on the other hand, it's a documented and
supported feature of both GCC and Clang, and other compilers might get
it too.  I wouldn't worry about it vanishing with the next GCC update.

[toc] | [prev] | [next] | [standalone]

#134516

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2026-01-18 15:26 +0100
Message-ID	<nnd$7abbb4a6$4c11959f@4809eb2c0ea40e3a>
In reply to	#134515

On 18-01-2026 05:21, Paul Rubin wrote:
> Hans Bezemer <the.beez.speaks@gmail.com> writes:
>> The tail call method however, requires an entirely different
>> VM. That's a lot of work for about 10% performance improvement - that
>> may not even last for a single GCC update. And requires two VM's to
>> maintain..
> 
> You'd have to change the VM but on the other hand, it's a documented and
> supported feature of both GCC and Clang, and other compilers might get
> it too.  I wouldn't worry about it vanishing with the next GCC update.

Well, the "goto" feature hasn't disappeared as well. It's just been 
nullified. Rendered useless. That's what I mean. And again? 10%? Really?

Hans Bezemer

[toc] | [prev] | [next] | [standalone]

#134519

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2026-01-18 22:17 +0000
Message-ID	<2026Jan18.231745@mips.complang.tuwien.ac.at>
In reply to	#134514

Hans Bezemer <the.beez.speaks@gmail.com> writes:
>On 17-01-2026 08:10, Paul Rubin wrote:
>> Hans Bezemer <the.beez.speaks@gmail.com> writes:
>>> 5. I added GCC extension support to 4tH in version 3.62.0. At the
>>> time, it improved performance by about 25%. By accident I found out
>>> that was no longer true. switch() based was faster. I didn't know
>>> there had been changes in that regard to GCC.

You would have to look at the generated code.  Which gcc version did
you use?  Certainly in my results on
<http://www.complang.tuwien.ac.at/forth/threading/> switch usually is
slower than direct or indirect threaded code.

>The tail call method however, requires an entirely different VM.

It's also just a question of defining some macros appropriately.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [next] | [standalone]

#134517

From	albert@spenarnc.xs4all.nl
Date	2026-01-18 16:34 +0100
Message-ID	<nnd$613a150b$00989354@a56038f66d0e5c37>
In reply to	#134513

In article <87wm1gpvdr.fsf@nightsong.com>,
Paul Rubin  <no.email@nospam.invalid> wrote:
>Hans Bezemer <the.beez.speaks@gmail.com> writes:
>> 5. I added GCC extension support to 4tH in version 3.62.0. At the
>> time, it improved performance by about 25%. By accident I found out
>> that was no longer true. switch() based was faster. I didn't know
>> there had been changes in that regard to GCC.
>
>If you mean the goto *a feature, these days you might try using tail
>calls instead.  GCC and LLVM both now support a musttail attribute that
>ensures this optimization, or signals a compile-time error if it can't.
>
>https://lwn.net/Articles/1033373/

If you pass an address a as a tail call is it approximately equal
to coroutines:

: HEX:    R> BASE @ >R  >R HEX CO R> BASE ! ;

Used for example as

: .H HEX: . ;

In this case the tail call is `` R> BASE ! '' to restore the base?

Groetjes Albert
-- 
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

[toc] | [prev] | [next] | [standalone]

#134523

From	Paul Rubin <no.email@nospam.invalid>
Date	2026-01-20 00:35 -0800
Message-ID	<87bjioptpk.fsf@nightsong.com>
In reply to	#134517

albert@spenarnc.xs4all.nl writes:
> If you pass an address a as a tail call is it approximately equal
> to coroutines:

No I don't think so.  The tail call is just a jump to that address
(changes the program counter).  A coroutine jump also has to change the
stack pointer.  See the section "Knuth's coroutines" here:

https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Some Forths have a CO primitive that I think is similar.  There is
something like it on the Greenarrays processor.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

EuroForth 2025 preliminary proceedings

Contents

#134509 — EuroForth 2025 preliminary proceedings

#134510

#134511

#134512

#134531

#134532 — C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134533 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134534 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134536 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134538 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134542 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134544 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134548 — Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...)

#134513

#134514

#134515

#134516

#134519

#134517

#134523