Path: csiph.com!eternal-september.org!feeder.eternal-september.org!nntp.eternal-september.org!.POSTED.host-87-14-254-103.retail.telecomitalia.it!not-for-mail
From: peter <peter.noreply@tin.it>
Newsgroups: comp.lang.forth
Subject: Re: locals
Date: Mon, 27 Apr 2026 09:31:03 +0200
Organization: A noiseless patient Spider
Message-ID: <20260427093103.00002db9@tin.it>
References: <10qunhm$1nnbt$1@dont-email.me> <2026Apr25.064712@mips.complang.tuwien.ac.at> <87o6j74l1z.fsf@nightsong.com> <2026Apr25.084323@mips.complang.tuwien.ac.at> <nnd$22b8fb28$086616c8@3c6198f3bcb67197> <2026Apr25.122216@mips.complang.tuwien.ac.at> <20260425160747.00007f4a@tin.it> <2026Apr26.160303@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="host-87-14-254-103.retail.telecomitalia.it:87.14.254.103"; logging-data="2272362"; mail-complaints-to="abuse@eternal-september.org"
X-Newsreader: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32)
Xref: csiph.com comp.lang.forth:134997

On Sun, 26 Apr 2026 14:03:03 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> peter <peter.noreply@tin.it> writes:
> >I recently reviewed the string comparison for search-wordlist
> >and came up with the following
> >
> >The string stored in the word header is already uppercased.
> >So string comparison will be case insensitive
> >
> >: UC ( c -- c' )    \ uppercase char
> >    dup $61 $7B within  $20 and -  ;
> >
> >
> >: NCOMP4  ( addr n  addr' n'  - f)	\  0 is match 
> >    dup >r
> >    begin
> >        rot = while             \ str cstr
> >        r> dup 1- >r
> >            while               \ str cstr
> >            swap count uc       \ cstr str' s1
> >            rot count           \ str' s1 cstr' c1
> >    repeat   
> >        2drop r> drop 0 exit
> >    then
> >        2drop r> drop 1 ;
> >
> >First iteration in the loop it does not compare chars but the length!
> 
> Clever, but, at least without comment, too clever.
> 
> This code, and, more clearly, Hans Bezemers version demonstrate that
> STR= is easier than COMPARE, STRCMP, or STR<, because you can deal
> with the case of length difference right at the start, whereas the
> latter words have to check the characters up to the end of the shorter
> string first before dealing with the length.  This shows the greatest
> benefit in cases like
> 
> s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp
> 
> As for STRCMP, I have measured the five versions shown in my earlier
> posting (whole program posted below), with the bugs fixed, and the
> ?DUP IF replaced by DUP IF ... THEN DROP, because it produces better
> code.
> 
> I have also included the following versions:
> 
> : strcmp { addr1 u1 addr2 u2 -- n }
>  u1 u2 min 0
>  ?do
>    addr1 c@ addr2 c@ - ?dup
>    if
>      unloop exit
>    then
>    addr1 char+ TO addr1
>    addr2 char+ TO addr2
>  loop
>  u1 u2 - ;
> 
> This comes from the '94 paper and is the version that uses TO instead
> of defining new locals at every iteration.  Paul Rubin will love the
> code that current Gforth produces for "addr2 char+ TO addr2":
> 
> <strcmp+$E0>  @local2    1->2 
>    $7F337DA71BBA: mov    0x10(%rbp),%r15
> <strcmp+$E8>  char+    2->2 
>    $7F337DA71BBE: add    $0x1,%r15
> <strcmp+$F0>  !local2    2->1 
>    $7F337DA71BC2: mov    %r15,0x10(%rbp)
> 
> The TO <local> code was not that efficient in earlier Gforth versions.
> 
> The other version I added is:
> 
> : strcmp ( addr1 u1 addr2 u2 -- n )
>     rot 2dup 2>r min 0 ?do ( addr1 addr2 )
>         over c@ over c@ - dup if
>             nip nip 2rdrop unloop exit then
>         drop
>         char+ swap char+ swap
>     loop
>     2drop r> r> - ;
> 
> This is the STRCMP3 from <2024Apr9.175958@mips.complang.tuwien.ac.at>
> and may be the locals-less version I compared against in the '94
> paper.
> 
> I also included your version (without the UC call) and Hans Bezemer's
> version.
> 
> I benchmarked two Forth systems, gforth-fast and gforth-itc.
> gforth-itc uses indirect-threaded code and should perform similar to
> the "simple interpreters" that Paul Rubin had in mind.
> 
> I ran three different benchmarks on these words, which performed the
> following a number of times:
> 
> s" 0123456789abcdefg" 2dup strcmp drop \ bench1
> s" 0123456789abcdefg" s" 2123456789abcdefg" strcmp drop \ bench2
> s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp drop \bench3
> 
> In bench1 the strings are equal and everything has to be compared.  In
> bench2 the strings have the same length, but differ in the first char,
> so the loop can terminate after the first char.  In bench3 the strings
> have different length, but all chars that both strings have are the
> same.  In the latter case versionpeter and versionbezemer have an
> advantage from not performing the same functionality.
> 
> The cycles numbers are per invocation of STRCMP, including benchmark overhead.
> 
> The benchmarks are run on a Ryzen 8700G (Zen4)>
> 
> In addition to the cycles, I also show the bytes of the native code of
> the whole word in gforth-fast on AMD64 (without the final jmp (2
> Bytes)), and of the loop (including the code for the if...then).
> 
>    Bytes   | cycles gforth-fast  |  cycles gforth-itc   |
> strcmp loop|bench1 bench2 bench3 | bench1 bench2 bench3 |
>   262  127 | 109.5   16.6  109.4 | 1732.7  147.4 1724.5 | version0
>   303  151 | 164.2   17.2  164.4 | 1714.1  170.4 1613.5 | version1
>   257  122 | 105.3   17.4  105.1 | 1496.7  166.4 1493.0 | version2
>   280  113 |  98.6   19.2   99.0 | 1230.1  194.4 1116.2 | version3
>   267  118 |  91.2   17.9   91.2 | 1268.6  198.4 1269.0 | version4
>   273  108 |  89.9   17.0   90.0 | 1136.0  178.4 1138.9 | version5
>   261  128 | 121.1   14.6  118.5 | 1221.4  131.3 1213.3 | version6
>   210  142 | 137.5   15.4    9.5 | 1244.4  155.3   78.3 | versionpeter
>   260  119 | 107.8   16.4    9.8 | 1186.2  134.5   71.3 | versionbezemer
> 
> So the champion among the full-featured strcmps for bench1 and bench3
> is version5, for bench2 version6.  The str= variants are much faster
> for bench3 (of course), but slower than several other versions for
> bench1 and slower than version6 for bench2.  The native code size is
> smallest for version2 (among the full-featured strcmp
> implementations), so the locals-less versions do not win everything.
> 
> So locals-less (version5 and version6) is somewhat faster on both
> gforth-fast and gforth-itc.
> 
> lxf has a more efficient locals implementation.  Let's see how it
> fares.  It does not support the usage in version1, so I leave that
> away.
> 
>      cycles lxf 
> bench1 bench2 bench3
>   79.9   12.0   79.9 version0
>   99.6   12.0   99.6 version2
>   98.8   14.1   98.1 version3
>   86.0   13.2   86.0 version4
>   84.1   12.6   84.2 version5
>   88.7   10.0   92.8 version6
>   98.3   10.0    6.0 versionpeter
>   72.1    9.5    6.0 versionbezemer
> 
> On lxf version0 (with locals) is the fastest for bench1 and bench3,
> and version6 is the fastest for bench2.  Hans Bezemers version wins
> everything if we are only interested in str= functionality.

Anton, thanks for running all these tests.
I have now also run them on my Ryzen 9950X.
There is an error in version 6 that i corrected.
2rdrop needs to be after unloop. On lxf64 that uses registers for
loop parameters this is necessary!
version3 does not run as lxf64 does not support defining locals
several times. I will see if this can be changed.

I needed also to change the log-fd to 5 to get it to run.
The tests are run with Debian under WSL2.

Here are the results

lxf64
  59.1   10.0   57.6 version0
  48.1   10.0   48.4 version2
  43.0   10.7   42.5 version4
  42.2    9.1   42.2 version5
  55.1    9.0   55.0 version6
  65.7    8.0    6.0 versionpeter
  32.8    9.0    4.2 versionbezemer

lxf
  64.2    8.5   64.2 version0
 112.3   10.2   90.1 version2
  78.8   10.6   75.6 version4
  88.1    9.4   88.2 version5
 112.2    7.5  114.7 version6
  71.0    8.2    7.4 versionpeter
  50.9    8.3    4.3 versionbezemer

There is a significant impact in having loop parameters in registers!
version 2 and 6 are interesting for lxf. The full stat gives some more
info. Sorry for the long lines
version 2 compared to version 0

Peter@R9950WSL:/mnt/d/Dev/forth/lxf32v17$ perf stat ./lxf "create version2 100000000 constant iterations include strcmp.4th bench1 bye"


 Performance counter stats for './lxf create version2 100000000 constant iterations include strcmp.4th bench1 bye':

          1,955.50 msec task-clock:u                     #    0.998 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                64      page-faults:u                    #   32.728 /sec
    10,973,742,845      cycles:u                         #    5.612 GHz
       966,332,718      stalled-cycles-frontend:u        #    8.81% frontend cycles idle
    34,901,611,693      instructions:u                   #    3.18  insn per cycle
                                                  #    0.03  stalled cycles per insn
     3,900,350,964      branches:u                       #    1.995 G/sec
            36,727      branch-misses:u                  #    0.00% of all branches

       1.960183288 seconds time elapsed

       1.955783000 seconds user
       0.000000000 seconds sys


peter@R9950WSL:/mnt/d/Dev/forth/lxf32v17$ perf stat ./lxf "create version0 100000000 constant iterations include strcmp.4th bench1 bye"


 Performance counter stats for './lxf create version0 100000000 constant iterations include strcmp.4th bench1 bye':

          1,158.97 msec task-clock:u                     #    0.996 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                64      page-faults:u                    #   55.221 /sec
     6,415,119,211      cycles:u                         #    5.535 GHz
         4,510,117      stalled-cycles-frontend:u        #    0.07% frontend cycles idle
    38,301,605,801      instructions:u                   #    5.97  insn per cycle
                                                  #    0.00  stalled cycles per insn
     3,900,348,894      branches:u                       #    3.365 G/sec
            19,563      branch-misses:u                  #    0.00% of all branches

       1.163667408 seconds time elapsed

       1.151174000 seconds user
       0.007966000 seconds sys

BR
Peter

> 
> And here's the code (measurement scripts at the bottom):
> ----------------------------------------------------------
> [defined] version0 [if]
> : strcmp {: addr1 u1 addr2 u2 -- n :}
>  u1 u2 min 0
>  ?do
>    addr1 c@ addr2 c@ - dup
>    if
>      unloop exit
>    then
>    drop
>    addr1 char+ TO addr1
>    addr2 char+ TO addr2
>  loop
>  u1 u2 - ;
> [then]
> 
> [defined] version1 [if]
> : strcmp {: addr1 u1 addr2 u2 -- n :}
>  addr1 addr2
>  u1 u2 min 0 
>  ?do {: s1 s2 :}
>    s1 c@ s2 c@ - dup 
>    if
>      unloop exit
>    then
>    drop s1 char+ s2 char+
>  loop
>  2drop
>  u1 u2 - ;
> [then]
> 
> [defined] version2 [if]
> : strcmp {: addr1 u1 addr2 u2 -- n :}
>  u1 u2 min 0 
>  ?do
>    addr1 i + c@ addr2 i + c@ - dup 
>    if
>      unloop exit
>    then
>    drop
>  loop
>  u1 u2 - ;
> [then]
> 
> [defined] version3 [if]
> : strcmp {: addr1 u1 addr2 u2 -- n :}
>  addr2 addr1 - {: offset :}
>  u1 u2 min addr1 + addr1 ?do
>    i c@ i offset + c@ - dup 
>    if
>      unloop exit
>    then
>    drop
>  loop
>  u1 u2 - ;
> [then]
> 
> [defined] version4 [if]
> : strcmp {: addr1 u1 addr2 u2 -- n :}
>  addr2 addr1 - ( offset )
>  u1 u2 min addr1 + addr1 ?do ( offset )
>    dup i + c@ i c@ - dup 
>    if
>      nip negate unloop exit
>    then
>    drop
>  loop
>  drop u1 u2 - ;
> [then]
> 
> [defined] version5 [if]
> : strcmp ( addr1 u1 addr2 u2 -- n )
>  rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
>  min -rot over - ( u12 addr1 offset R: n1 )
>  swap rot bounds ( offset limit start R: n1 )
>  ?do ( offset R: n1 loop-sys )
>    dup i + c@ i c@ - dup 
>    if
>      nip negate unloop r> drop exit
>    then
>    drop
>  loop
>  drop r> negate ;
> [then]
> 
> [defined] version6 [if]
> [undefined] 2rdrop [if]
> : 2rdrop postpone 2r> postpone 2drop ; immediate
> [then]
> 
> : strcmp ( addr1 u1 addr2 u2 -- n )
>     rot 2dup 2>r min 0 ?do ( addr1 addr2 )
>         over c@ over c@ - dup if
>             nip nip 2rdrop unloop exit then
>         drop
>         char+ swap char+ swap
>     loop
>     2drop r> r> - ;
> [then]
> 
> [defined] versionpeter [if]
>     \ from <20260425160747.00007f4a@tin.it>
>     \ renamed and deleted the call to UC
> : strcmp  ( addr n  addr' n'  - f)	\  0 is match 
>     dup >r
>     begin
>         rot = while             \ str cstr
>         r> dup 1- >r
>             while               \ str cstr
>             swap count          \ cstr str' s1
>             rot count           \ str' s1 cstr' c1
>     repeat   
>         2drop r> drop 0 exit
>     then
>         2drop r> drop 1 ;
> [then]
> 
> [defined] versionbezemer [if]
>     \ from <nnd$548d4f1b$1e104571@905dda44db1f54ae>
>     \ renamed
> : strcmp
>    rot over - if drop 2drop true exit then
>    0 ?do
>      over i chars + c@ over i chars + c@ -
>      if drop drop unloop true exit then
>    loop drop drop false
> ;
> [then]
> 
> [defined] t{ [if]
>     t{ s" abc" s" abc" strcmp -> 0 }t
>     t{ s" abc" s" abcd" strcmp -> -1 }t
>     t{ s" abc" s" abd" strcmp -> -1 }t
>     t{ s" abd" s" abc" strcmp -> 1 }t
>     t{ s" cbc" s" abc" strcmp -> 2 }t
>     t{ s" abc" s" adc" strcmp -> -2 }t
> [then]
> 
> \ Benchmarks
> 
> [undefined] iterations [if]
>     100000000 constant iterations
> [then]
> 
> : benchmark ( c-addr1 u1 c-addr2 u2 -- )
>     iterations 0 do
>         2over 2over strcmp drop
>     loop
>     2drop 2drop ;
> 
> : bench1
>     s" 0123456789abcdefg" 2dup benchmark ;
> 
> : bench2
>     s" 0123456789abcdefg" s" 2123456789abcdefg" benchmark ;
> 
> : bench3
>     s" 0123456789abcdefg" s" 0123456789abcdefgh" benchmark ;
>     
> 
> 0 [if]
> # bash script for producing the cycles
> IFS=":"
> for i in 0 1 2 3 4 5 6 peter bezemer; do
>     for forthit in gforth-fast:100000000 gforth-itc:10000000; do
>         fields=($forthit); forth="${fields[0]}"; iterations="${fields[1]}"
>         for bench in 1 2 3; do
>             perf stat --log-fd 3 -x, -e cycles:u $forth -e "create version$i $iterations constant iterations" ~/forth/strcmp.4th -e "bench$bench bye" 3>&1 >/dev/null|
>                  awk -F, '{printf "%6.1f ",$1/'$iterations'}'
>          done
>      done
>      echo version$i
> done
> IFS=":"
> for i in 0 2 3 4 5 6 peter bezemer; do
>     forth=lxf; iterations=100000000
>     for bench in 1 2 3; do
>         perf stat --log-fd 3 -x, -e cycles:u $forth "create version$i $iterations constant iterations include $HOME/forth/strcmp.4th bench$bench bye" 3>&1 >/dev/null|
>              awk -F, '{printf "%6.1f ",$1/'$iterations'}'
>      done
>      echo version$i
> done
> [then]
> --------------------------------------------------------------
> 
> - anton