Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #134509 > unrolled thread
| Started by | dxf <dxforth@gmail.com> |
|---|---|
| First post | 2026-01-15 17:41 +1100 |
| Last post | 2026-01-20 22:17 +0000 |
| Articles | 20 on this page of 30 — 7 participants |
Back to article view | Back to comp.lang.forth
EuroForth 2025 preliminary proceedings dxf <dxforth@gmail.com> - 2026-01-15 17:41 +1100
Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-15 12:04 +0000
Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-16 15:25 +0100
Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-16 17:38 +0000
Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-22 16:51 +0100
C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 11:28 +0000
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 16:47 +0000
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) peter <peter.noreply@tin.it> - 2026-01-25 23:31 +0100
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-26 19:24 +0000
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) peter <peter.noreply@tin.it> - 2026-01-27 15:44 +0100
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-29 18:27 +0000
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) albert@spenarnc.xs4all.nl - 2026-01-30 13:20 +0100
Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-30 18:00 +0000
Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-16 23:10 -0800
Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-17 16:58 +0100
Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-17 20:21 -0800
Re: EuroForth 2025 preliminary proceedings Hans Bezemer <the.beez.speaks@gmail.com> - 2026-01-18 15:26 +0100
Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-18 22:17 +0000
Re: EuroForth 2025 preliminary proceedings albert@spenarnc.xs4all.nl - 2026-01-18 16:34 +0100
Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-20 00:35 -0800
Re: EuroForth 2025 preliminary proceedings albert@spenarnc.xs4all.nl - 2026-01-20 12:12 +0100
Coroutines in Forth Gerry Jackson <do-not-use@swldwa.uk> - 2026-04-02 20:59 +0100
Re: Coroutines in Forth Paul Rubin <no.email@nospam.invalid> - 2026-04-04 18:02 -0700
Re: Coroutines in Forth Paul Rubin <no.email@nospam.invalid> - 2026-04-04 21:21 -0700
Re: EuroForth 2025 preliminary proceedings peter <peter.noreply@tin.it> - 2026-01-19 23:26 +0100
Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-19 15:22 -0800
Re: EuroForth 2025 preliminary proceedings peter <peter.noreply@tin.it> - 2026-01-20 10:44 +0100
Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-20 22:36 +0000
Re: EuroForth 2025 preliminary proceedings Paul Rubin <no.email@nospam.invalid> - 2026-01-20 00:33 -0800
Re: EuroForth 2025 preliminary proceedings anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-20 22:17 +0000
Page 1 of 2 [1] 2 Next page →
| From | dxf <dxforth@gmail.com> |
|---|---|
| Date | 2026-01-15 17:41 +1100 |
| Subject | EuroForth 2025 preliminary proceedings |
| Message-ID | <69688c01$1@news.ausics.net> |
As I had trouble finding it, perhaps others too. Here's the link: http://www.euroforth.org/ef25/papers/ There is no link from the main page. Someone had referenced Nick Nelson's 'Forth 2025' paper and I was curious to read it.
[toc] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-15 12:04 +0000 |
| Message-ID | <2026Jan15.130413@mips.complang.tuwien.ac.at> |
| In reply to | #134509 |
dxf <dxforth@gmail.com> writes:
>As I had trouble finding it, perhaps others too. Here's the link:
>
>http://www.euroforth.org/ef25/papers/
>
>There is no link from the main page.
Thank you. As it happens, yesterday I created the post-conference
proceedings that includes a late paper and the slides that were
provided by their authors (not that many; apparently many authors are
content with the prospect of their presentation being preserved on
video). I have now updated various links for the post-conference
state (link from www.euroforth.org to proceedings, and from the
proceedings to the euro.theforth.net page).
Unfortunately, the videos are not yet available. Gerald Wodni has not
yet had the time to process them. He mentioned something like "after
January" or somesuch.
I think that submitting slides has not just the advantage that they
are published earlier in this case (or at all in the 2023 case, where
the audio was so problematic that most videos were not published), but
also that one can read them faster than watch a video; the videos have
the audio track and interactive demos in addition to the text and
graphics of the slides, though.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
[toc] | [prev] | [next] | [standalone]
| From | Hans Bezemer <the.beez.speaks@gmail.com> |
|---|---|
| Date | 2026-01-16 15:25 +0100 |
| Message-ID | <nnd$3a148ef5$137ee4b5@b1e8191b89e23503> |
| In reply to | #134510 |
On 15-01-2026 13:04, Anton Ertl wrote: A few observations concerning the IMHO most interesting paper, "Code-Copying Compilation in Production": 1. Code copying indeed makes a big difference, overall I estimate about twice as fast; 2. The performance of VFX Forth continues to impress me, keeping up nicely, even with C compiled code; 3. Commercial compilers (partly) using conventional compilers (see TF, fig. 4.7) - that was new to me; 4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me too. I might experiment with that one; 5. I added GCC extension support to 4tH in version 3.62.0. At the time, it improved performance by about 25%. By accident I found out that was no longer true. switch() based was faster. I didn't know there had been changes in that regard to GCC. Hans Bezemer > dxf <dxforth@gmail.com> writes: >> As I had trouble finding it, perhaps others too. Here's the link: >> >> http://www.euroforth.org/ef25/papers/ >> >> There is no link from the main page. > > Thank you. As it happens, yesterday I created the post-conference > proceedings that includes a late paper and the slides that were > provided by their authors (not that many; apparently many authors are > content with the prospect of their presentation being preserved on > video). I have now updated various links for the post-conference > state (link from www.euroforth.org to proceedings, and from the > proceedings to the euro.theforth.net page). > > Unfortunately, the videos are not yet available. Gerald Wodni has not > yet had the time to process them. He mentioned something like "after > January" or somesuch. > > I think that submitting slides has not just the advantage that they > are published earlier in this case (or at all in the 2023 case, where > the audio was so problematic that most videos were not published), but > also that one can read them faster than watch a video; the videos have > the audio track and interactive demos in addition to the text and > graphics of the slides, though. > > - anton
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-16 17:38 +0000 |
| Message-ID | <2026Jan16.183803@mips.complang.tuwien.ac.at> |
| In reply to | #134511 |
Hans Bezemer <the.beez.speaks@gmail.com> writes:
>On 15-01-2026 13:04, Anton Ertl wrote:
>
>A few observations concerning the IMHO most interesting paper,
>"Code-Copying Compilation in Production":
...
>3. Commercial compilers (partly) using conventional compilers (see TF,
>fig. 4.7) - that was new to me;
All Forth compilers I know work at the text interpretation level as
the "Forth compiler" of Thinking Forth, Figure 4.7.
>4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
>too. I might experiment with that one;
I have analyzed it for bubblesort. There the problem is that gcc -O3
auto-vectorizes the pair of loads and the pair of stores (when the two
elements are swapped). As a result, if a pair is stored in one
iteration, the next iteration loads a pair that overlaps the
previously stored pair. This means that the hardware cannot use its
fast path in store-to-load forwarding, and leads to a huge slowdown.
For a benchmark that has been around for over 40 years.
In addition, the code generated by gcc -O3 also executes several
additonal instructions per iteration, so I doubt that it would be
faster even if the store-to-load forwarding problem did not exist.
For fib, I have also looked at the generated code, but have not
understood it well enough to see why the code generated by gcc -O3 is
slower.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
[toc] | [prev] | [next] | [standalone]
| From | Hans Bezemer <the.beez.speaks@gmail.com> |
|---|---|
| Date | 2026-01-22 16:51 +0100 |
| Message-ID | <nnd$7cecfc2e$135c60e6@11ec9b68cac8aeb0> |
| In reply to | #134512 |
On 16-01-2026 18:38, Anton Ertl wrote: On 17-01-2026 16:58, Hans Bezemer wrote: I've done my thing, compiled 4tH with optimizations -O3 till -O0. I thought, let's make this simple and execute ALL benchmarks I got. Some of them have become useless, though for the simple reason hardware has become that much better. But still, here it is. Overall, the performance consistently deteriorates, aka -O3 gives the best performance. There are a few minor glitches, some due to random benchmark data. For those curious, this is a European CSV with all the data. BTW, you can find all benchmarks here: https://sourceforge.net/p/forth-4th/code/HEAD/tree/trunk/4th.src/bench/ Hans Bezemer ---8<--- Benchmark;-O3;-O2;-O1;-O0 bench.4th;6.79;6.36;6.68;6.33 benchm.4th;1.21;1.66;1.86;2.8 benchxls.4th;0.06;0.08;0.08;0.12 bubble.4th;0.69;0.95;0.96;1.72 bytesiev.4th;0.01;0.01;0.01;0.02 countbit.4th;3.52;4.76;5.02;8.01 cowell.4th;15.15;20.2;18.91;31.29 fib.4th;0.79;1.02;1.02;1.72 isortest.4th;0.23;0.33;0.31;0.56 matrix.4th;0.22;0.31;0.3;0.51 misty.4th;0.58;0.84;1.01;1.59 pforth.4th;10.47;13.55;14.42;22.68 prims.4th;5.96;8;8.59;14.28 simple.4th;0.5;0.7;0.82;1.21 sortest.4th;140.96;163.68;150.17;270.87 thread.4th;0.35;0.41;0.49;0.7 ---8<--- > Hans Bezemer <the.beez.speaks@gmail.com> writes: >> On 15-01-2026 13:04, Anton Ertl wrote: >> >> A few observations concerning the IMHO most interesting paper, >> "Code-Copying Compilation in Production": > ... >> 3. Commercial compilers (partly) using conventional compilers (see TF, >> fig. 4.7) - that was new to me; > > All Forth compilers I know work at the text interpretation level as > the "Forth compiler" of Thinking Forth, Figure 4.7. > >> 4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me >> too. I might experiment with that one; > > I have analyzed it for bubblesort. There the problem is that gcc -O3 > auto-vectorizes the pair of loads and the pair of stores (when the two > elements are swapped). As a result, if a pair is stored in one > iteration, the next iteration loads a pair that overlaps the > previously stored pair. This means that the hardware cannot use its > fast path in store-to-load forwarding, and leads to a huge slowdown. > For a benchmark that has been around for over 40 years. > > In addition, the code generated by gcc -O3 also executes several > additonal instructions per iteration, so I doubt that it would be > faster even if the store-to-load forwarding problem did not exist. > > For fib, I have also looked at the generated code, but have not > understood it well enough to see why the code generated by gcc -O3 is > slower. > > - anton
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-24 11:28 +0000 |
| Subject | C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <2026Jan24.122830@mips.complang.tuwien.ac.at> |
| In reply to | #134531 |
Hans Bezemer <the.beez.speaks@gmail.com> writes:
>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
>I thought, let's make this simple and execute ALL benchmarks I got. Some
>of them have become useless, though for the simple reason hardware has
>become that much better.
>
>But still, here it is. Overall, the performance consistently
>deteriorates, aka -O3 gives the best performance.
Which compiler and which hardware?
For a random program, I would expect higher optimization levels to
produe faster code. For a Forth system and these recent gccs, the
auto-vectorization of adjacent memory accesses may lead to similar
problems as in the C bubble-sort benchmark. In Gforth, this actually
happens unless we disable vectorization (which we normally do), and,
moreover, with the vectorized code, gcc introduces additional
inefficiencies (see below).
Here's the output of ./gforth-fast onebench.fs compiled from the
current development version with gcc-12.2 and running on a Ryzen 5800X
(numbers are times, lower is better):
sieve bubble matrix fib fft gcc options
0.025 0.023 0.013 0.033 0.016 -O2
0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic
So how is the code different? Here's the code for ROT:
-O3 (auto-vectorized) -O3 -fno-tree-vec... -O2
add $0x8,%rbx add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r10),%xmm1 mov 0x8(%r10),%rdx mov 0x8(%r10),%rdx
mov 0x10(%r10),%rcx mov 0x10(%r10),%rax mov 0x10(%r10),%rax
punpcklqdq %xmm1,%xmm1 mov %r13,0x8(%r10) mov %r13,0x8(%r10)
punpckhqdq %xmm1,%xmm0 mov %rdx,0x10(%r10) mov %rdx,0x10(%r10)
movups %xmm0,0x8(%r10) mov %rax,%r13 mov %rax,%r13
mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
mov %r14,0x8(%rsp) jmp *%rax jmp *%rax
mov %rax,%r11
mov %r15,%r9
mov %rcx,0x10(%rsp)
jmp 0x55bff2a58a99
So in this case -O3 without auto-vectorization generates the same code
as -O2. Auto-vectorization, OTOH, replaces
mov 0x8(%r10),%rdx
mov 0x10(%r10),%rax
with
movq 0x8(%r10),%xmm1
and then performs the rotation with the punpck instructions, finally
storing two cells into memory with movups. For some reason it also
separately loads 0x10(%r10) into %rcx (instead of extracting it from
%xmm1), and eventually stores it to 0x10(%rsp), which seems to be one
of the locations of the TOS.
I expect that gcc's auto-vectorization will do similar things to
primitives like ROT 2! 2SWAP (all of which are hit in gforth) in other
Forth systems with a C substrate, because they all tend to access two
(or more) adjacent cells.
But the big hit with the auto-vectorized code is not these changes,
but what happens at the end of the primitive: without
auto-vectorization there is the indirect jump of the threaded-code
dispatch, but with auto-vectorization it jumps to 0x55bff2a58a99:
0x000055bff2a58a99 <gforth_engine2+153>: movq 0x8(%rsp),%xmm0
0x000055bff2a58a9f <gforth_engine2+159>: movq %r9,%xmm1
0x000055bff2a58aa4 <gforth_engine2+164>: movhps 0x8(%rsp),%xmm1
0x000055bff2a58aa9 <gforth_engine2+169>: movhps 0x10(%rsp),%xmm0
0x000055bff2a58aae <gforth_engine2+174>: movhlps %xmm0,%xmm5
0x000055bff2a58ab1 <gforth_engine2+177>: movq %xmm0,%r14
0x000055bff2a58ab6 <gforth_engine2+182>: movq %xmm1,%r15
0x000055bff2a58abb <gforth_engine2+187>: movhps %xmm1,0x18(%rsp)
0x000055bff2a58ac0 <gforth_engine2+192>: movq %xmm5,%r8
0x000055bff2a58ac5 <gforth_engine2+197>: mov %r15,%rdi
0x000055bff2a58ac8 <gforth_engine2+200>: mov %r14,%rsi
0x000055bff2a58acb <gforth_engine2+203>: mov %r8,%rcx
0x000055bff2a58ace <gforth_engine2+206>: jmp *%r11
We can see here that, among other things 0x10(%rsp) (the TOS) is
loaded into %xmm0 and then moved through %xmm5 into %r8 and the %rcx,
as well as through %r14 into %rsi so at the end TOS resides in all
those places. And I see that other primitives expect the TOS in some
of those places, e.g. 1+:
-O3 (auto-vectorized) -O3 -fno-tree-vec...
add $0x8,%rbx add $0x8,%rbx
lea 0x1(%r8),%rcx add $0x1,%r13
mov (%rbx),%rax mov (%rbx),%rax
mov %r14,0x8(%rsp) jmp *%rax
mov %rax,%r11
mov %r15,%r9
mov %rcx,0x10(%rsp)
jmp 0x55bff2a58a99
Jumping to 0x55bff2a58a99 instead of performing an indirect jump
disables dynamic native code generation in Gforth and all the
optimizations that are based on it. You can see in the --no-dynamic
line how much that costs. The remaining factor of 3 is probably due
to the large number of additional instructions that are performed in
the auto-vectorized engine.
What is the 4th code for ROT with -O2 and -O3?
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-24 16:47 +0000 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <2026Jan24.174716@mips.complang.tuwien.ac.at> |
| In reply to | #134532 |
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Hans Bezemer <the.beez.speaks@gmail.com> writes:
>>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
>>I thought, let's make this simple and execute ALL benchmarks I got. Some
>>of them have become useless, though for the simple reason hardware has
>>become that much better.
>>
>>But still, here it is. Overall, the performance consistently
>>deteriorates, aka -O3 gives the best performance.
>
>Which compiler and which hardware?
>
>For a random program, I would expect higher optimization levels to
>produe faster code. For a Forth system and these recent gccs, the
>auto-vectorization of adjacent memory accesses may lead to similar
>problems as in the C bubble-sort benchmark. In Gforth, this actually
>happens unless we disable vectorization (which we normally do), and,
>moreover, with the vectorized code, gcc introduces additional
>inefficiencies (see below).
>
>Here's the output of ./gforth-fast onebench.fs compiled from the
>current development version with gcc-12.2 and running on a Ryzen 5800X
>(numbers are times, lower is better):
>
> sieve bubble matrix fib fft gcc options
> 0.025 0.023 0.013 0.033 0.016 -O2
> 0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
> 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
> 0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic
I have now also tried it with gcc-14.2, and that produces better code.
Results from a Xeon E-2388G (Rocket Lake):
sieve bubble matrix fib fft gcc options
0.032 0.032 0.015 0.037 0.014 -O2
0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:
-O3 (auto-vectorized) -O3 -fno-tree-vectorize
add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
add $0x18,%r13 mov 0x8(%r13),%rdx
movhps -0x8(%r13),%xmm0 add $0x18,%r13
movups %xmm0,(%r8) mov %rdx,(%r8)
mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
mov (%rbx),%rax mov 0x0(%r13),%r8
jmp *%rax mov (%rbx),%rax
jmp *%rax
And the common tail with all these move instructions is gone.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
[toc] | [prev] | [next] | [standalone]
| From | peter <peter.noreply@tin.it> |
|---|---|
| Date | 2026-01-25 23:31 +0100 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <20260125233110.000034b4@tin.it> |
| In reply to | #134533 |
On Sat, 24 Jan 2026 16:47:16 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >Hans Bezemer <the.beez.speaks@gmail.com> writes:
> >>I've done my thing, compiled 4tH with optimizations -O3 till -O0.
> >>I thought, let's make this simple and execute ALL benchmarks I got. Some
> >>of them have become useless, though for the simple reason hardware has
> >>become that much better.
> >>
> >>But still, here it is. Overall, the performance consistently
> >>deteriorates, aka -O3 gives the best performance.
> >
> >Which compiler and which hardware?
> >
> >For a random program, I would expect higher optimization levels to
> >produe faster code. For a Forth system and these recent gccs, the
> >auto-vectorization of adjacent memory accesses may lead to similar
> >problems as in the C bubble-sort benchmark. In Gforth, this actually
> >happens unless we disable vectorization (which we normally do), and,
> >moreover, with the vectorized code, gcc introduces additional
> >inefficiencies (see below).
> >
> >Here's the output of ./gforth-fast onebench.fs compiled from the
> >current development version with gcc-12.2 and running on a Ryzen 5800X
> >(numbers are times, lower is better):
> >
> > sieve bubble matrix fib fft gcc options
> > 0.025 0.023 0.013 0.033 0.016 -O2
> > 0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
> > 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
> > 0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic
>
> I have now also tried it with gcc-14.2, and that produces better code.
> Results from a Xeon E-2388G (Rocket Lake):
>
> sieve bubble matrix fib fft gcc options
> 0.032 0.032 0.015 0.037 0.014 -O2
> 0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
>
> The code for ROT and 2SWAP does not use auto-vectorization, and the
> code for 2! uses auto-vectorization in a way that reduces the
> instruction count:
>
> -O3 (auto-vectorized) -O3 -fno-tree-vectorize
> add $0x8,%rbx add $0x8,%rbx
> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
> add $0x18,%r13 mov 0x8(%r13),%rdx
> movhps -0x8(%r13),%xmm0 add $0x18,%r13
> movups %xmm0,(%r8) mov %rdx,(%r8)
> mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
> mov (%rbx),%rax mov 0x0(%r13),%r8
> jmp *%rax mov (%rbx),%rax
> jmp *%rax
>
> And the common tail with all these move instructions is gone.
>
> - anton
What does your C code looks like? I could not get clang or gcc to auto vectories
with my existing code
UNS64 *tmp64 = (UNS64*)TOP;
tmp64[0] = sp[0];
tmp64[1] = sp[1];
TOP = sp[2];
sp += 3;
In the end I changed my code to tell the compiler that it is a vector with
typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
and
*(v2u64*)TOP = *(v2u64*)sp;
TOP=sp[2];
sp=sp+3;
this will produce
vmovups xmm0, xmmword ptr [rdx]
vmovups xmmword ptr [r8], xmm0
mov r8, qword ptr [rdx + 16]
add rdx, 24
movzx r9d, byte ptr [rcx] // nesting code
inc rcx
jmp qword ptr [rax + 8*r9]
But also using memcpy((UNS64*)TOP, (UNS64*)sp,16); gives the same code!
Looks like it is working also in ARM64
BR
Peter
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-26 19:24 +0000 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <2026Jan26.202443@mips.complang.tuwien.ac.at> |
| In reply to | #134534 |
peter <peter.noreply@tin.it> writes:
>On Sat, 24 Jan 2026 16:47:16 GMT
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> I have now also tried it with gcc-14.2, and that produces better code.
>> Results from a Xeon E-2388G (Rocket Lake):
>>
>> sieve bubble matrix fib fft gcc options
>> 0.032 0.032 0.015 0.037 0.014 -O2
>> 0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
>> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
>>
>> The code for ROT and 2SWAP does not use auto-vectorization, and the
>> code for 2! uses auto-vectorization in a way that reduces the
>> instruction count:
>>
>> -O3 (auto-vectorized) -O3 -fno-tree-vectorize
>> add $0x8,%rbx add $0x8,%rbx
>> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
>> add $0x18,%r13 mov 0x8(%r13),%rdx
>> movhps -0x8(%r13),%xmm0 add $0x18,%r13
>> movups %xmm0,(%r8) mov %rdx,(%r8)
>> mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
>> mov (%rbx),%rax mov 0x0(%r13),%r8
>> jmp *%rax mov (%rbx),%rax
>> jmp *%rax
>>
>> And the common tail with all these move instructions is gone.
>>
>> - anton
>
>What does your C code looks like? I could not get clang or gcc to auto vectories
>with my existing code
>
> UNS64 *tmp64 = (UNS64*)TOP;
> tmp64[0] = sp[0];
> tmp64[1] = sp[1];
> TOP = sp[2];
> sp += 3;
Gforth's source code for 2! is:
2! ( w1 w2 a_addr -- ) core two_store
""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""
a_addr[0] = w2;
a_addr[1] = w1;
A generator produces the following from that, which is passed to gcc:
LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */
/* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */
NAME("2!")
ip += 1;
LABEL1(two_store)
{
DEF_CA
MAYBE_UNUSED Cell w1;
MAYBE_UNUSED Cell w2;
MAYBE_UNUSED Cell * a_addr;
NEXT_P0;
vm_Cell2w(sp[2],w1);
vm_Cell2w(sp[1],w2);
vm_Cell2a_(spTOS,a_addr);
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" w1=", vm_out); printarg_w(w1);
fputs(" w2=", vm_out); printarg_w(w2);
fputs(" a_addr=", vm_out); printarg_a_(a_addr);
}
#endif
sp += 3;
{
#line 1815 "prim"
a_addr[0] = w2;
a_addr[1] = w1;
#line 10136 "prim-fast.i"
}
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" -- ", vm_out); fputc('\n', vm_out);
}
#endif
NEXT_P1;
spTOS = sp[0];
LABEL2(two_store)
NAME1("l2-two_store")
NEXT_P1_5;
LABEL3(two_store)
NAME1("l3-two_store")
DO_GOTO;
}
There are a lot of macros in this code, and I fear that expanding them
makes the code even less readable, but the essence for the
auto-vectorized part is something like:
w1 = sp[2];
w2 = sp[1];
a_addr = spTOS;
sp += 3;
a_addr[0] = w2;
a_addr[1] = w1;
spTOS = sp[0];
My guess is that in your code the compiler expected that sp[1] might
alias with tmp64[0], and therefore did not vectorize the loads and the
stores, whereas in the Gforth code, the loads both happen first, and
then the two stores, and gcc can vectorize that. I doubt that there
is a big benefit from that, though.
>typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
I'll have to remember the aligned attribute for future games with gcc
explicit vectorization.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
[toc] | [prev] | [next] | [standalone]
| From | peter <peter.noreply@tin.it> |
|---|---|
| Date | 2026-01-27 15:44 +0100 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <20260127154455.00000f73@tin.it> |
| In reply to | #134536 |
On Mon, 26 Jan 2026 19:24:43 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> peter <peter.noreply@tin.it> writes:
> >On Sat, 24 Jan 2026 16:47:16 GMT
> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >> I have now also tried it with gcc-14.2, and that produces better code.
> >> Results from a Xeon E-2388G (Rocket Lake):
> >>
> >> sieve bubble matrix fib fft gcc options
> >> 0.032 0.032 0.015 0.037 0.014 -O2
> >> 0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
> >> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
> >>
> >> The code for ROT and 2SWAP does not use auto-vectorization, and the
> >> code for 2! uses auto-vectorization in a way that reduces the
> >> instruction count:
> >>
> >> -O3 (auto-vectorized) -O3 -fno-tree-vectorize
> >> add $0x8,%rbx add $0x8,%rbx
> >> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
> >> add $0x18,%r13 mov 0x8(%r13),%rdx
> >> movhps -0x8(%r13),%xmm0 add $0x18,%r13
> >> movups %xmm0,(%r8) mov %rdx,(%r8)
> >> mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
> >> mov (%rbx),%rax mov 0x0(%r13),%r8
> >> jmp *%rax mov (%rbx),%rax
> >> jmp *%rax
> >>
> >> And the common tail with all these move instructions is gone.
> >>
> >> - anton
> >
> >What does your C code looks like? I could not get clang or gcc to auto vectories
> >with my existing code
> >
> > UNS64 *tmp64 = (UNS64*)TOP;
> > tmp64[0] = sp[0];
> > tmp64[1] = sp[1];
> > TOP = sp[2];
> > sp += 3;
>
> Gforth's source code for 2! is:
>
> 2! ( w1 w2 a_addr -- ) core two_store
> ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell.""
> a_addr[0] = w2;
> a_addr[1] = w1;
>
> A generator produces the following from that, which is passed to gcc:
>
> LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */
> /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */
> NAME("2!")
> ip += 1;
> LABEL1(two_store)
> {
> DEF_CA
> MAYBE_UNUSED Cell w1;
> MAYBE_UNUSED Cell w2;
> MAYBE_UNUSED Cell * a_addr;
> NEXT_P0;
> vm_Cell2w(sp[2],w1);
> vm_Cell2w(sp[1],w2);
> vm_Cell2a_(spTOS,a_addr);
> #ifdef VM_DEBUG
> if (vm_debug) {
> fputs(" w1=", vm_out); printarg_w(w1);
> fputs(" w2=", vm_out); printarg_w(w2);
> fputs(" a_addr=", vm_out); printarg_a_(a_addr);
> }
> #endif
> sp += 3;
> {
> #line 1815 "prim"
> a_addr[0] = w2;
> a_addr[1] = w1;
> #line 10136 "prim-fast.i"
> }
>
> #ifdef VM_DEBUG
> if (vm_debug) {
> fputs(" -- ", vm_out); fputc('\n', vm_out);
> }
> #endif
> NEXT_P1;
> spTOS = sp[0];
> LABEL2(two_store)
> NAME1("l2-two_store")
> NEXT_P1_5;
> LABEL3(two_store)
> NAME1("l3-two_store")
> DO_GOTO;
> }
>
> There are a lot of macros in this code, and I fear that expanding them
> makes the code even less readable, but the essence for the
> auto-vectorized part is something like:
>
> w1 = sp[2];
> w2 = sp[1];
> a_addr = spTOS;
> sp += 3;
> a_addr[0] = w2;
> a_addr[1] = w1;
> spTOS = sp[0];
>
> My guess is that in your code the compiler expected that sp[1] might
> alias with tmp64[0], and therefore did not vectorize the loads and the
> stores, whereas in the Gforth code, the loads both happen first, and
> then the two stores, and gcc can vectorize that. I doubt that there
> is a big benefit from that, though.
Yes that was it. changing to:
UNS64 *tmp64 = (UNS64*)TOP;
UNS64 d0=sp[0];
UNS64 d1=sp[1];
tmp64[0] = d0;
tmp64[1] = d1;
TOP = sp[2];
sp += 3;
made the compiler (clang-21 in this case) generate the expected code
>
> >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
>
> I'll have to remember the aligned attribute for future games with gcc
> explicit vectorization.
Without that it will generate the opcodes that needs 16 byte alignment
BR
Peter
> - anton
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-29 18:27 +0000 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <2026Jan29.192712@mips.complang.tuwien.ac.at> |
| In reply to | #134538 |
peter <peter.noreply@tin.it> writes:
>On Mon, 26 Jan 2026 19:24:43 GMT
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> peter <peter.noreply@tin.it> writes:
>> >On Sat, 24 Jan 2026 16:47:16 GMT
>> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> >> The code for ROT and 2SWAP does not use auto-vectorization, and the
>> >> code for 2! uses auto-vectorization in a way that reduces the
>> >> instruction count:
>> >>
>> >> -O3 (auto-vectorized) -O3 -fno-tree-vectorize
>> >> add $0x8,%rbx add $0x8,%rbx
>> >> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
>> >> add $0x18,%r13 mov 0x8(%r13),%rdx
>> >> movhps -0x8(%r13),%xmm0 add $0x18,%r13
>> >> movups %xmm0,(%r8) mov %rdx,(%r8)
>> >> mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
>> >> mov (%rbx),%rax mov 0x0(%r13),%r8
>> >> jmp *%rax mov (%rbx),%rax
>> >> jmp *%rax
...
> UNS64 *tmp64 = (UNS64*)TOP;
> UNS64 d0=sp[0];
> UNS64 d1=sp[1];
> tmp64[0] = d0;
> tmp64[1] = d1;
> TOP = sp[2];
> sp += 3;
>
>made the compiler (clang-21 in this case) generate the expected code
The auto-vectorized implementation of 2! above should perform ok,
because it loads each stack item separately, and the wide movups is
only used for the stores. If there is a wide load from the stack
involved, I expect a significant slowdown, because the stack items
usually have been stored recently, and narrow-store-to-wide-load
forwarding is a slow path on recent (and presumably also older) CPU
cores: https://www.complang.tuwien.ac.at/anton/stwlf/
>> >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
>>
>> I'll have to remember the aligned attribute for future games with gcc
>> explicit vectorization.
>
>Without that it will generate the opcodes that needs 16 byte alignment
Yes. Until now I worked around that by using memcpy to a vector
variable, but this approach is much more convenient.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
[toc] | [prev] | [next] | [standalone]
| From | albert@spenarnc.xs4all.nl |
|---|---|
| Date | 2026-01-30 13:20 +0100 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <nnd$1f48c6e8$48795598@3884e8505482cce2> |
| In reply to | #134542 |
In article <2026Jan29.192712@mips.complang.tuwien.ac.at>, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >peter <peter.noreply@tin.it> writes: >>On Mon, 26 Jan 2026 19:24:43 GMT >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: >> >>> peter <peter.noreply@tin.it> writes: >>> >On Sat, 24 Jan 2026 16:47:16 GMT >>> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: >>> >> The code for ROT and 2SWAP does not use auto-vectorization, and the >>> >> code for 2! uses auto-vectorization in a way that reduces the >>> >> instruction count: >>> >> >>> >> -O3 (auto-vectorized) -O3 -fno-tree-vectorize >>> >> add $0x8,%rbx add $0x8,%rbx >>> >> movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax >>> >> add $0x18,%r13 mov 0x8(%r13),%rdx >>> >> movhps -0x8(%r13),%xmm0 add $0x18,%r13 >>> >> movups %xmm0,(%r8) mov %rdx,(%r8) >>> >> mov 0x0(%r13),%r8 mov %rax,0x8(%r8) >>> >> mov (%rbx),%rax mov 0x0(%r13),%r8 >>> >> jmp *%rax mov (%rbx),%rax >>> >> jmp *%rax >... >> UNS64 *tmp64 = (UNS64*)TOP; >> UNS64 d0=sp[0]; >> UNS64 d1=sp[1]; >> tmp64[0] = d0; >> tmp64[1] = d1; >> TOP = sp[2]; >> sp += 3; >> >>made the compiler (clang-21 in this case) generate the expected code > >The auto-vectorized implementation of 2! above should perform ok, >because it loads each stack item separately, and the wide movups is >only used for the stores. If there is a wide load from the stack >involved, I expect a significant slowdown, because the stack items >usually have been stored recently, and narrow-store-to-wide-load >forwarding is a slow path on recent (and presumably also older) CPU >cores: https://www.complang.tuwien.ac.at/anton/stwlf/ > >>> >typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8))); >>> >>> I'll have to remember the aligned attribute for future games with gcc >>> explicit vectorization. >> >>Without that it will generate the opcodes that needs 16 byte alignment > >Yes. Until now I worked around that by using memcpy to a vector >variable, but this approach is much more convenient. I always wonder, is this relevant to the industrial applications of gforth or gforth based programs that are sold commercially? > >- anton -- The Chinese government is satisfied with its military superiority over USA. The next 5 year plan has as primary goal to advance life expectancy over 80 years, like Western Europe.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-30 18:00 +0000 |
| Subject | Re: C compiler optimization and Forth engines (was: EuroForth 2025 ...) |
| Message-ID | <2026Jan30.190031@mips.complang.tuwien.ac.at> |
| In reply to | #134544 |
albert@spenarnc.xs4all.nl writes:
>I always wonder, is this relevant to the industrial applications of
>gforth or gforth based programs that are sold commercially?
Are there any Gforth-based programs that are sold commercially?
Concerning industrial applications, the only ones I know about have to
do with Open Firmware, and I doubt that those care much about the
performance of Gforth. But there are probably industrial applications
(maybe even commercial programs) that use Gforth that I do not know
about. If one of the IBM users had not contacted us when he left the
group, I would not know about the application of Gforth within IBM.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2026-01-16 23:10 -0800 |
| Message-ID | <87wm1gpvdr.fsf@nightsong.com> |
| In reply to | #134511 |
Hans Bezemer <the.beez.speaks@gmail.com> writes: > 5. I added GCC extension support to 4tH in version 3.62.0. At the > time, it improved performance by about 25%. By accident I found out > that was no longer true. switch() based was faster. I didn't know > there had been changes in that regard to GCC. If you mean the goto *a feature, these days you might try using tail calls instead. GCC and LLVM both now support a musttail attribute that ensures this optimization, or signals a compile-time error if it can't. https://lwn.net/Articles/1033373/
[toc] | [prev] | [next] | [standalone]
| From | Hans Bezemer <the.beez.speaks@gmail.com> |
|---|---|
| Date | 2026-01-17 16:58 +0100 |
| Message-ID | <nnd$4c8a9957$56c8fc09@4ad2500852ea2034> |
| In reply to | #134513 |
On 17-01-2026 08:10, Paul Rubin wrote: > Hans Bezemer <the.beez.speaks@gmail.com> writes: >> 5. I added GCC extension support to 4tH in version 3.62.0. At the >> time, it improved performance by about 25%. By accident I found out >> that was no longer true. switch() based was faster. I didn't know >> there had been changes in that regard to GCC. > > If you mean the goto *a feature, these days you might try using tail > calls instead. GCC and LLVM both now support a musttail attribute that > ensures this optimization, or signals a compile-time error if it can't. > > https://lwn.net/Articles/1033373/ Thanks for the article! But contrary to the Python interpreter, you could (thanks to some preprocessor magic) select how 4tH's VM would be compiled with NO changes to the source code whatsoever. That's why it could be reversed so easily by accident. The tail call method however, requires an entirely different VM. That's a lot of work for about 10% performance improvement - that may not even last for a single GCC update. And requires two VM's to maintain.. So, I have to contemplate this carefully before putting work in it. But it's nice to know that I was not crazy noticing this ;-) And learning about a new GCC technique. :) Hans Bezemer
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2026-01-17 20:21 -0800 |
| Message-ID | <87sec3pn3r.fsf@nightsong.com> |
| In reply to | #134514 |
Hans Bezemer <the.beez.speaks@gmail.com> writes: > The tail call method however, requires an entirely different > VM. That's a lot of work for about 10% performance improvement - that > may not even last for a single GCC update. And requires two VM's to > maintain.. You'd have to change the VM but on the other hand, it's a documented and supported feature of both GCC and Clang, and other compilers might get it too. I wouldn't worry about it vanishing with the next GCC update.
[toc] | [prev] | [next] | [standalone]
| From | Hans Bezemer <the.beez.speaks@gmail.com> |
|---|---|
| Date | 2026-01-18 15:26 +0100 |
| Message-ID | <nnd$7abbb4a6$4c11959f@4809eb2c0ea40e3a> |
| In reply to | #134515 |
On 18-01-2026 05:21, Paul Rubin wrote: > Hans Bezemer <the.beez.speaks@gmail.com> writes: >> The tail call method however, requires an entirely different >> VM. That's a lot of work for about 10% performance improvement - that >> may not even last for a single GCC update. And requires two VM's to >> maintain.. > > You'd have to change the VM but on the other hand, it's a documented and > supported feature of both GCC and Clang, and other compilers might get > it too. I wouldn't worry about it vanishing with the next GCC update. Well, the "goto" feature hasn't disappeared as well. It's just been nullified. Rendered useless. That's what I mean. And again? 10%? Really? Hans Bezemer
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-01-18 22:17 +0000 |
| Message-ID | <2026Jan18.231745@mips.complang.tuwien.ac.at> |
| In reply to | #134514 |
Hans Bezemer <the.beez.speaks@gmail.com> writes:
>On 17-01-2026 08:10, Paul Rubin wrote:
>> Hans Bezemer <the.beez.speaks@gmail.com> writes:
>>> 5. I added GCC extension support to 4tH in version 3.62.0. At the
>>> time, it improved performance by about 25%. By accident I found out
>>> that was no longer true. switch() based was faster. I didn't know
>>> there had been changes in that regard to GCC.
You would have to look at the generated code. Which gcc version did
you use? Certainly in my results on
<http://www.complang.tuwien.ac.at/forth/threading/> switch usually is
slower than direct or indirect threaded code.
>The tail call method however, requires an entirely different VM.
It's also just a question of defining some macros appropriately.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
[toc] | [prev] | [next] | [standalone]
| From | albert@spenarnc.xs4all.nl |
|---|---|
| Date | 2026-01-18 16:34 +0100 |
| Message-ID | <nnd$613a150b$00989354@a56038f66d0e5c37> |
| In reply to | #134513 |
In article <87wm1gpvdr.fsf@nightsong.com>, Paul Rubin <no.email@nospam.invalid> wrote: >Hans Bezemer <the.beez.speaks@gmail.com> writes: >> 5. I added GCC extension support to 4tH in version 3.62.0. At the >> time, it improved performance by about 25%. By accident I found out >> that was no longer true. switch() based was faster. I didn't know >> there had been changes in that regard to GCC. > >If you mean the goto *a feature, these days you might try using tail >calls instead. GCC and LLVM both now support a musttail attribute that >ensures this optimization, or signals a compile-time error if it can't. > >https://lwn.net/Articles/1033373/ If you pass an address a as a tail call is it approximately equal to coroutines: : HEX: R> BASE @ >R >R HEX CO R> BASE ! ; Used for example as : .H HEX: . ; In this case the tail call is `` R> BASE ! '' to restore the base? Groetjes Albert -- The Chinese government is satisfied with its military superiority over USA. The next 5 year plan has as primary goal to advance life expectancy over 80 years, like Western Europe.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2026-01-20 00:35 -0800 |
| Message-ID | <87bjioptpk.fsf@nightsong.com> |
| In reply to | #134517 |
albert@spenarnc.xs4all.nl writes: > If you pass an address a as a tail call is it approximately equal > to coroutines: No I don't think so. The tail call is just a jump to that address (changes the program counter). A coroutine jump also has to change the stack pointer. See the section "Knuth's coroutines" here: https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html Some Forths have a CO primitive that I think is similar. There is something like it on the Greenarrays processor.
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.forth
csiph-web