Path: csiph.com!eternal-september.org!feeder.eternal-september.org!nntp.eternal-september.org!.POSTED.host-87-14-254-103.retail.telecomitalia.it!not-for-mail From: peter Newsgroups: comp.lang.forth Subject: Re: locals Date: Mon, 27 Apr 2026 09:31:03 +0200 Organization: A noiseless patient Spider Message-ID: <20260427093103.00002db9@tin.it> References: <10qunhm$1nnbt$1@dont-email.me> <2026Apr25.064712@mips.complang.tuwien.ac.at> <87o6j74l1z.fsf@nightsong.com> <2026Apr25.084323@mips.complang.tuwien.ac.at> <2026Apr25.122216@mips.complang.tuwien.ac.at> <20260425160747.00007f4a@tin.it> <2026Apr26.160303@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Info: dont-email.me; posting-host="host-87-14-254-103.retail.telecomitalia.it:87.14.254.103"; logging-data="2272362"; mail-complaints-to="abuse@eternal-september.org" X-Newsreader: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) Xref: csiph.com comp.lang.forth:134997 On Sun, 26 Apr 2026 14:03:03 GMT anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: > peter writes: > >I recently reviewed the string comparison for search-wordlist > >and came up with the following > > > >The string stored in the word header is already uppercased. > >So string comparison will be case insensitive > > > >: UC ( c -- c' ) \ uppercase char > > dup $61 $7B within $20 and - ; > > > > > >: NCOMP4 ( addr n addr' n' - f) \ 0 is match > > dup >r > > begin > > rot = while \ str cstr > > r> dup 1- >r > > while \ str cstr > > swap count uc \ cstr str' s1 > > rot count \ str' s1 cstr' c1 > > repeat > > 2drop r> drop 0 exit > > then > > 2drop r> drop 1 ; > > > >First iteration in the loop it does not compare chars but the length! > > Clever, but, at least without comment, too clever. > > This code, and, more clearly, Hans Bezemers version demonstrate that > STR= is easier than COMPARE, STRCMP, or STR<, because you can deal > with the case of length difference right at the start, whereas the > latter words have to check the characters up to the end of the shorter > string first before dealing with the length. This shows the greatest > benefit in cases like > > s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp > > As for STRCMP, I have measured the five versions shown in my earlier > posting (whole program posted below), with the bugs fixed, and the > ?DUP IF replaced by DUP IF ... THEN DROP, because it produces better > code. > > I have also included the following versions: > > : strcmp { addr1 u1 addr2 u2 -- n } > u1 u2 min 0 > ?do > addr1 c@ addr2 c@ - ?dup > if > unloop exit > then > addr1 char+ TO addr1 > addr2 char+ TO addr2 > loop > u1 u2 - ; > > This comes from the '94 paper and is the version that uses TO instead > of defining new locals at every iteration. Paul Rubin will love the > code that current Gforth produces for "addr2 char+ TO addr2": > > @local2 1->2 > $7F337DA71BBA: mov 0x10(%rbp),%r15 > char+ 2->2 > $7F337DA71BBE: add $0x1,%r15 > !local2 2->1 > $7F337DA71BC2: mov %r15,0x10(%rbp) > > The TO code was not that efficient in earlier Gforth versions. > > The other version I added is: > > : strcmp ( addr1 u1 addr2 u2 -- n ) > rot 2dup 2>r min 0 ?do ( addr1 addr2 ) > over c@ over c@ - dup if > nip nip 2rdrop unloop exit then > drop > char+ swap char+ swap > loop > 2drop r> r> - ; > > This is the STRCMP3 from <2024Apr9.175958@mips.complang.tuwien.ac.at> > and may be the locals-less version I compared against in the '94 > paper. > > I also included your version (without the UC call) and Hans Bezemer's > version. > > I benchmarked two Forth systems, gforth-fast and gforth-itc. > gforth-itc uses indirect-threaded code and should perform similar to > the "simple interpreters" that Paul Rubin had in mind. > > I ran three different benchmarks on these words, which performed the > following a number of times: > > s" 0123456789abcdefg" 2dup strcmp drop \ bench1 > s" 0123456789abcdefg" s" 2123456789abcdefg" strcmp drop \ bench2 > s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp drop \bench3 > > In bench1 the strings are equal and everything has to be compared. In > bench2 the strings have the same length, but differ in the first char, > so the loop can terminate after the first char. In bench3 the strings > have different length, but all chars that both strings have are the > same. In the latter case versionpeter and versionbezemer have an > advantage from not performing the same functionality. > > The cycles numbers are per invocation of STRCMP, including benchmark overhead. > > The benchmarks are run on a Ryzen 8700G (Zen4)> > > In addition to the cycles, I also show the bytes of the native code of > the whole word in gforth-fast on AMD64 (without the final jmp (2 > Bytes)), and of the loop (including the code for the if...then). > > Bytes | cycles gforth-fast | cycles gforth-itc | > strcmp loop|bench1 bench2 bench3 | bench1 bench2 bench3 | > 262 127 | 109.5 16.6 109.4 | 1732.7 147.4 1724.5 | version0 > 303 151 | 164.2 17.2 164.4 | 1714.1 170.4 1613.5 | version1 > 257 122 | 105.3 17.4 105.1 | 1496.7 166.4 1493.0 | version2 > 280 113 | 98.6 19.2 99.0 | 1230.1 194.4 1116.2 | version3 > 267 118 | 91.2 17.9 91.2 | 1268.6 198.4 1269.0 | version4 > 273 108 | 89.9 17.0 90.0 | 1136.0 178.4 1138.9 | version5 > 261 128 | 121.1 14.6 118.5 | 1221.4 131.3 1213.3 | version6 > 210 142 | 137.5 15.4 9.5 | 1244.4 155.3 78.3 | versionpeter > 260 119 | 107.8 16.4 9.8 | 1186.2 134.5 71.3 | versionbezemer > > So the champion among the full-featured strcmps for bench1 and bench3 > is version5, for bench2 version6. The str= variants are much faster > for bench3 (of course), but slower than several other versions for > bench1 and slower than version6 for bench2. The native code size is > smallest for version2 (among the full-featured strcmp > implementations), so the locals-less versions do not win everything. > > So locals-less (version5 and version6) is somewhat faster on both > gforth-fast and gforth-itc. > > lxf has a more efficient locals implementation. Let's see how it > fares. It does not support the usage in version1, so I leave that > away. > > cycles lxf > bench1 bench2 bench3 > 79.9 12.0 79.9 version0 > 99.6 12.0 99.6 version2 > 98.8 14.1 98.1 version3 > 86.0 13.2 86.0 version4 > 84.1 12.6 84.2 version5 > 88.7 10.0 92.8 version6 > 98.3 10.0 6.0 versionpeter > 72.1 9.5 6.0 versionbezemer > > On lxf version0 (with locals) is the fastest for bench1 and bench3, > and version6 is the fastest for bench2. Hans Bezemers version wins > everything if we are only interested in str= functionality. Anton, thanks for running all these tests. I have now also run them on my Ryzen 9950X. There is an error in version 6 that i corrected. 2rdrop needs to be after unloop. On lxf64 that uses registers for loop parameters this is necessary! version3 does not run as lxf64 does not support defining locals several times. I will see if this can be changed. I needed also to change the log-fd to 5 to get it to run. The tests are run with Debian under WSL2. Here are the results lxf64 59.1 10.0 57.6 version0 48.1 10.0 48.4 version2 43.0 10.7 42.5 version4 42.2 9.1 42.2 version5 55.1 9.0 55.0 version6 65.7 8.0 6.0 versionpeter 32.8 9.0 4.2 versionbezemer lxf 64.2 8.5 64.2 version0 112.3 10.2 90.1 version2 78.8 10.6 75.6 version4 88.1 9.4 88.2 version5 112.2 7.5 114.7 version6 71.0 8.2 7.4 versionpeter 50.9 8.3 4.3 versionbezemer There is a significant impact in having loop parameters in registers! version 2 and 6 are interesting for lxf. The full stat gives some more info. Sorry for the long lines version 2 compared to version 0 Peter@R9950WSL:/mnt/d/Dev/forth/lxf32v17$ perf stat ./lxf "create version2 100000000 constant iterations include strcmp.4th bench1 bye" Performance counter stats for './lxf create version2 100000000 constant iterations include strcmp.4th bench1 bye': 1,955.50 msec task-clock:u # 0.998 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 64 page-faults:u # 32.728 /sec 10,973,742,845 cycles:u # 5.612 GHz 966,332,718 stalled-cycles-frontend:u # 8.81% frontend cycles idle 34,901,611,693 instructions:u # 3.18 insn per cycle # 0.03 stalled cycles per insn 3,900,350,964 branches:u # 1.995 G/sec 36,727 branch-misses:u # 0.00% of all branches 1.960183288 seconds time elapsed 1.955783000 seconds user 0.000000000 seconds sys peter@R9950WSL:/mnt/d/Dev/forth/lxf32v17$ perf stat ./lxf "create version0 100000000 constant iterations include strcmp.4th bench1 bye" Performance counter stats for './lxf create version0 100000000 constant iterations include strcmp.4th bench1 bye': 1,158.97 msec task-clock:u # 0.996 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 64 page-faults:u # 55.221 /sec 6,415,119,211 cycles:u # 5.535 GHz 4,510,117 stalled-cycles-frontend:u # 0.07% frontend cycles idle 38,301,605,801 instructions:u # 5.97 insn per cycle # 0.00 stalled cycles per insn 3,900,348,894 branches:u # 3.365 G/sec 19,563 branch-misses:u # 0.00% of all branches 1.163667408 seconds time elapsed 1.151174000 seconds user 0.007966000 seconds sys BR Peter > > And here's the code (measurement scripts at the bottom): > ---------------------------------------------------------- > [defined] version0 [if] > : strcmp {: addr1 u1 addr2 u2 -- n :} > u1 u2 min 0 > ?do > addr1 c@ addr2 c@ - dup > if > unloop exit > then > drop > addr1 char+ TO addr1 > addr2 char+ TO addr2 > loop > u1 u2 - ; > [then] > > [defined] version1 [if] > : strcmp {: addr1 u1 addr2 u2 -- n :} > addr1 addr2 > u1 u2 min 0 > ?do {: s1 s2 :} > s1 c@ s2 c@ - dup > if > unloop exit > then > drop s1 char+ s2 char+ > loop > 2drop > u1 u2 - ; > [then] > > [defined] version2 [if] > : strcmp {: addr1 u1 addr2 u2 -- n :} > u1 u2 min 0 > ?do > addr1 i + c@ addr2 i + c@ - dup > if > unloop exit > then > drop > loop > u1 u2 - ; > [then] > > [defined] version3 [if] > : strcmp {: addr1 u1 addr2 u2 -- n :} > addr2 addr1 - {: offset :} > u1 u2 min addr1 + addr1 ?do > i c@ i offset + c@ - dup > if > unloop exit > then > drop > loop > u1 u2 - ; > [then] > > [defined] version4 [if] > : strcmp {: addr1 u1 addr2 u2 -- n :} > addr2 addr1 - ( offset ) > u1 u2 min addr1 + addr1 ?do ( offset ) > dup i + c@ i c@ - dup > if > nip negate unloop exit > then > drop > loop > drop u1 u2 - ; > [then] > > [defined] version5 [if] > : strcmp ( addr1 u1 addr2 u2 -- n ) > rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 ) > min -rot over - ( u12 addr1 offset R: n1 ) > swap rot bounds ( offset limit start R: n1 ) > ?do ( offset R: n1 loop-sys ) > dup i + c@ i c@ - dup > if > nip negate unloop r> drop exit > then > drop > loop > drop r> negate ; > [then] > > [defined] version6 [if] > [undefined] 2rdrop [if] > : 2rdrop postpone 2r> postpone 2drop ; immediate > [then] > > : strcmp ( addr1 u1 addr2 u2 -- n ) > rot 2dup 2>r min 0 ?do ( addr1 addr2 ) > over c@ over c@ - dup if > nip nip 2rdrop unloop exit then > drop > char+ swap char+ swap > loop > 2drop r> r> - ; > [then] > > [defined] versionpeter [if] > \ from <20260425160747.00007f4a@tin.it> > \ renamed and deleted the call to UC > : strcmp ( addr n addr' n' - f) \ 0 is match > dup >r > begin > rot = while \ str cstr > r> dup 1- >r > while \ str cstr > swap count \ cstr str' s1 > rot count \ str' s1 cstr' c1 > repeat > 2drop r> drop 0 exit > then > 2drop r> drop 1 ; > [then] > > [defined] versionbezemer [if] > \ from > \ renamed > : strcmp > rot over - if drop 2drop true exit then > 0 ?do > over i chars + c@ over i chars + c@ - > if drop drop unloop true exit then > loop drop drop false > ; > [then] > > [defined] t{ [if] > t{ s" abc" s" abc" strcmp -> 0 }t > t{ s" abc" s" abcd" strcmp -> -1 }t > t{ s" abc" s" abd" strcmp -> -1 }t > t{ s" abd" s" abc" strcmp -> 1 }t > t{ s" cbc" s" abc" strcmp -> 2 }t > t{ s" abc" s" adc" strcmp -> -2 }t > [then] > > \ Benchmarks > > [undefined] iterations [if] > 100000000 constant iterations > [then] > > : benchmark ( c-addr1 u1 c-addr2 u2 -- ) > iterations 0 do > 2over 2over strcmp drop > loop > 2drop 2drop ; > > : bench1 > s" 0123456789abcdefg" 2dup benchmark ; > > : bench2 > s" 0123456789abcdefg" s" 2123456789abcdefg" benchmark ; > > : bench3 > s" 0123456789abcdefg" s" 0123456789abcdefgh" benchmark ; > > > 0 [if] > # bash script for producing the cycles > IFS=":" > for i in 0 1 2 3 4 5 6 peter bezemer; do > for forthit in gforth-fast:100000000 gforth-itc:10000000; do > fields=($forthit); forth="${fields[0]}"; iterations="${fields[1]}" > for bench in 1 2 3; do > perf stat --log-fd 3 -x, -e cycles:u $forth -e "create version$i $iterations constant iterations" ~/forth/strcmp.4th -e "bench$bench bye" 3>&1 >/dev/null| > awk -F, '{printf "%6.1f ",$1/'$iterations'}' > done > done > echo version$i > done > IFS=":" > for i in 0 2 3 4 5 6 peter bezemer; do > forth=lxf; iterations=100000000 > for bench in 1 2 3; do > perf stat --log-fd 3 -x, -e cycles:u $forth "create version$i $iterations constant iterations include $HOME/forth/strcmp.4th bench$bench bye" 3>&1 >/dev/null| > awk -F, '{printf "%6.1f ",$1/'$iterations'}' > done > echo version$i > done > [then] > -------------------------------------------------------------- > > - anton