Groups > comp.lang.forth > #134221 > unrolled thread

Generating a random sequence of Forth words

Started by	anton@mips.complang.tuwien.ac.at (Anton Ertl)
First post	2025-09-30 16:33 +0000
Last post	2025-10-24 15:55 +0000
Articles	15 — 6 participants

Back to article view | Back to comp.lang.forth

  Generating a random sequence of Forth words anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-09-30 16:33 +0000
    Re: Generating a random sequence of Forth words minforth <minforth@gmx.net> - 2025-10-01 11:20 +0200
      Re: Generating a random sequence of Forth words anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-01 17:10 +0000
    Re: Generating a random sequence of Forth words Hans Bezemer <the.beez.speaks@gmail.com> - 2025-10-01 17:11 +0200
      Re: Generating a random sequence of Forth words minforth <minforth@gmx.net> - 2025-10-01 20:42 +0200
        Re: Generating a random sequence of Forth words dxf <dxforth@gmail.com> - 2025-10-02 19:49 +1000
          Re: Generating a random sequence of Forth words albert@spenarnc.xs4all.nl - 2025-10-02 13:07 +0200
            Re: Generating a random sequence of Forth words dxf <dxforth@gmail.com> - 2025-10-03 18:22 +1000
          3dup again (was: Generating a random sequence of Forth words) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-02 20:44 +0000
            Re: 3dup again (was: Generating a random sequence of Forth words) albert@spenarnc.xs4all.nl - 2025-10-03 11:02 +0200
              Re: 3dup again minforth <minforth@gmx.net> - 2025-10-03 11:09 +0200
                Re: 3dup again anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-04 08:04 +0000
            Re: 3dup again Hans Bezemer <the.beez.speaks@gmail.com> - 2025-10-05 11:29 +0200
    Re: Generating a random sequence of Forth words antispam@fricas.org (Waldek Hebisch) - 2025-10-15 19:19 +0000
      Re: Generating a random sequence of Forth words anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-24 15:55 +0000

#134221 — Generating a random sequence of Forth words

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2025-09-30 16:33 +0000
Subject	Generating a random sequence of Forth words
Message-ID	<2025Sep30.183350@mips.complang.tuwien.ac.at>

I wanted to generate a random sequence consisting of a number, NOOP,
and DROP.  When EVALUATEing the sequence, starting with an empty
stack, it should end with an empty stack, and should not underflow the
stack.

I implemented this twice.  The first attempt just generated a random
word/number in the normal case, and a specific kind of word/number in
the boundary cases:

: genseq {: u -- :}
    0 u 0 do
        case 
            dup 0= ?of 1+ ." 9 " endof
            dup i + u u>= ?of 1- ." drop " endof
            dup i + 1+ u u>= ?of ." noop " endof
            rnd 3 um* nip 1- dup case
                -1 of ." drop " endof
                0  of ." noop " endof
                1  of ." 8 " endof
            endcase
            +
            0 endcase
    loop
    drop ;

When I do 16 GENSEQ with seed 1234, I get the following sequence:

9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

The requirements are distributed across several places in this word,
which I did not like.  I also did not like the non-randomness of the
sequence of DROPs at the end, and the non-randomness of the 9 whenever
the stack is empty.  I also found that I needed to add a clause for
when the stack depth is one less than the number of remaining words.

So I thought about other ways to do it.

Eventually I came up with the following approach: Starts with counters
for the different kinds of words/numbers.  Whenever one word/number is
generated, its counter is reduced.  At each step the probability of
generating a kind of word/number is proportional to the remaining
counter for that word/number.  The same-depth requirement is
implemented by having the same initial counter for the number and
DROP.  The no-underflow requirement is implemented by skipping the
output of a DROP if the count of remaining DROPs = the count of the
remaining numbers.

Here's the code:

create counts 3 cells allot

: initcounts ( u -- )
    dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

: .counts ( -- )
    counts 3 cell array>mem mem+do
        i @ 4 .r
    loop ;

: countsth-- ( u -- )
    -1 counts rot th +! ;

: select1 ( u -- )
    \ u is the index of the word
    dup case
        0 of ." 9 "    countsth-- endof
        1 of ." noop " countsth-- endof
        2 of counts @ counts 2 th @ u< if
                ." drop " countsth--
            else
                drop endif
        endof
    endcase ;

: sum ( addr u -- u2 )
    \ addr u is an array with u elements, u2 is the sum of the elements
    0 -rot cell array>mem mem+do
        i @ +
    loop ;

: genseq ( u -- )
    initcounts
    begin
        counts 3 sum dup while
            rnd um* nip ( sum )
            3 0 u+do ( sum1 )
                counts i th @ 2dup u< if
                    i select1 then
                - loop
            drop
    repeat
    drop ;

The code is much longer, but I like it better.  The requirements are
concentrated in two places.

16 GENSEQ generates

9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

for seed 123.  A disadvantage of this approach is that it does not
generate a word/number on each iteration through the main loop.  I
have thought about how to achieve this, but the result would be quite
a bit more complex; it's also only a minor issue.  E.g., for
generating 1000 words/numbers with seed 123, the main loop has 1036
iterations.

The second approach takes quite a bit more time; for generating a
million words/numbers into a string on a Zen 4, 192_876_892 cycles
compared to 146_644_156 cycles.

The original goal was to produce a sequence where various parts of the
text interpreter have branch mispredictions; the less predictable the
sequence, the better.  In this respect the second sequence is
significantly better.  For a sequence of 1000 words/numbers,
text-interpreted 1000 times, the number of mispredictions is:

728_322 first approach
821_813 second approach

(with 247_125 mispredictions coming from (mainly) Gforth startup).

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [next] | [standalone]

#134224

From	minforth <minforth@gmx.net>
Date	2025-10-01 11:20 +0200
Message-ID	<mk4a3oFovluU1@mid.individual.net>
In reply to	#134221

Am 30.09.2025 um 18:33 schrieb Anton Ertl:
> I wanted to generate a random sequence consisting of a number, NOOP,
> and DROP.  When EVALUATEing the sequence, starting with an empty
> stack, it should end with an empty stack, and should not underflow the
> stack.
> 
> I implemented this twice.  The first attempt just generated a random
> word/number in the normal case, and a specific kind of word/number in
> the boundary cases:
> 
> : genseq {: u -- :}
>      0 u 0 do
>          case
>              dup 0= ?of 1+ ." 9 " endof
>              dup i + u u>= ?of 1- ." drop " endof
>              dup i + 1+ u u>= ?of ." noop " endof
>              rnd 3 um* nip 1- dup case
>                  -1 of ." drop " endof
>                  0  of ." noop " endof
>                  1  of ." 8 " endof
>              endcase
>              +
>              0 endcase
>      loop
>      drop ;
> 
> When I do 16 GENSEQ with seed 1234, I get the following sequence:
> 
> 9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop
> 
> The requirements are distributed across several places in this word,
> which I did not like.  I also did not like the non-randomness of the
> sequence of DROPs at the end, and the non-randomness of the 9 whenever
> the stack is empty.  I also found that I needed to add a clause for
> when the stack depth is one less than the number of remaining words.
> 
> So I thought about other ways to do it.
> 
> Eventually I came up with the following approach: Starts with counters
> for the different kinds of words/numbers.  Whenever one word/number is
> generated, its counter is reduced.  At each step the probability of
> generating a kind of word/number is proportional to the remaining
> counter for that word/number.  The same-depth requirement is
> implemented by having the same initial counter for the number and
> DROP.  The no-underflow requirement is implemented by skipping the
> output of a DROP if the count of remaining DROPs = the count of the
> remaining numbers.
> 
> Here's the code:
> 
> create counts 3 cells allot
> 
> : initcounts ( u -- )
>      dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;
> 
> : .counts ( -- )
>      counts 3 cell array>mem mem+do
>          i @ 4 .r
>      loop ;
> 
> : countsth-- ( u -- )
>      -1 counts rot th +! ;
> 
> : select1 ( u -- )
>      \ u is the index of the word
>      dup case
>          0 of ." 9 "    countsth-- endof
>          1 of ." noop " countsth-- endof
>          2 of counts @ counts 2 th @ u< if
>                  ." drop " countsth--
>              else
>                  drop endif
>          endof
>      endcase ;
> 
> : sum ( addr u -- u2 )
>      \ addr u is an array with u elements, u2 is the sum of the elements
>      0 -rot cell array>mem mem+do
>          i @ +
>      loop ;
> 
> : genseq ( u -- )
>      initcounts
>      begin
>          counts 3 sum dup while
>              rnd um* nip ( sum )
>              3 0 u+do ( sum1 )
>                  counts i th @ 2dup u< if
>                      i select1 then
>                  - loop
>              drop
>      repeat
>      drop ;
> 
> The code is much longer, but I like it better.  The requirements are
> concentrated in two places.
> 
> 16 GENSEQ generates
> 
> 9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop
> 
> for seed 123.  A disadvantage of this approach is that it does not
> generate a word/number on each iteration through the main loop.  I
> have thought about how to achieve this, but the result would be quite
> a bit more complex; it's also only a minor issue.  E.g., for
> generating 1000 words/numbers with seed 123, the main loop has 1036
> iterations.
> 
> The second approach takes quite a bit more time; for generating a
> million words/numbers into a string on a Zen 4, 192_876_892 cycles
> compared to 146_644_156 cycles.
> 
> The original goal was to produce a sequence where various parts of the
> text interpreter have branch mispredictions; the less predictable the
> sequence, the better.  In this respect the second sequence is
> significantly better.  For a sequence of 1000 words/numbers,
> text-interpreted 1000 times, the number of mispredictions is:
> 
> 728_322 first approach
> 821_813 second approach
> 
> (with 247_125 mispredictions coming from (mainly) Gforth startup).
>

How about
1) quick generation of a suboptimal list
2) create an array of pointers to the list elements
3) random shuffle the array e.g.
https://en.wikipedia.org/wiki/Fisher-Yates_shuffle

[toc] | [prev] | [next] | [standalone]

#134229

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2025-10-01 17:10 +0000
Message-ID	<2025Oct1.191034@mips.complang.tuwien.ac.at>
In reply to	#134224

minforth <minforth@gmx.net> writes:
>Am 30.09.2025 um 18:33 schrieb Anton Ertl:
>> Eventually I came up with the following approach: Starts with counters
>> for the different kinds of words/numbers.  Whenever one word/number is
>> generated, its counter is reduced.  At each step the probability of
>> generating a kind of word/number is proportional to the remaining
>> counter for that word/number.  The same-depth requirement is
>> implemented by having the same initial counter for the number and
>> DROP.  The no-underflow requirement is implemented by skipping the
>> output of a DROP if the count of remaining DROPs = the count of the
>> remaining numbers.
>> 
>> Here's the code:
>> 
>> create counts 3 cells allot
>> 
>> : initcounts ( u -- )
>>      dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;
>> 
>> : .counts ( -- )
>>      counts 3 cell array>mem mem+do
>>          i @ 4 .r
>>      loop ;
>> 
>> : countsth-- ( u -- )
>>      -1 counts rot th +! ;
>> 
>> : select1 ( u -- )
>>      \ u is the index of the word
>>      dup case
>>          0 of ." 9 "    countsth-- endof
>>          1 of ." noop " countsth-- endof
>>          2 of counts @ counts 2 th @ u< if
>>                  ." drop " countsth--
>>              else
>>                  drop endif
>>          endof
>>      endcase ;
>> 
>> : sum ( addr u -- u2 )
>>      \ addr u is an array with u elements, u2 is the sum of the elements
>>      0 -rot cell array>mem mem+do
>>          i @ +
>>      loop ;
>> 
>> : genseq ( u -- )
>>      initcounts
>>      begin
>>          counts 3 sum dup while
>>              rnd um* nip ( sum )
>>              3 0 u+do ( sum1 )
>>                  counts i th @ 2dup u< if
>>                      i select1 then
>>                  - loop
>>              drop
>>      repeat
>>      drop ;
...
>How about
>1) quick generation of a suboptimal list
>2) create an array of pointers to the list elements
>3) random shuffle the array e.g.
>https://en.wikipedia.org/wiki/Fisher-Yates_shuffle

I thought about an approach based on this in between, with appropriate
changes to satisfy the requirements.  Then I continued onwards to the
second approach shown above.  The counters are a kind of compressed
form of "a suboptimal list".  Every time you pick one of the elements
of the suboptimal list, the counter for that kind of element is
lowered by 1.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [next] | [standalone]

#134227

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2025-10-01 17:11 +0200
Message-ID	<nnd$60c544bd$6fdf6f8e@13070880e71750ef>
In reply to	#134221

I used something similar - but with a whole slew of stack operations. 
Very handy thingy. I use it until this day.

( abc -- abcabc)   >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

Hans Bezemer

On 30-09-2025 18:33, Anton Ertl wrote:
> I wanted to generate a random sequence consisting of a number, NOOP,
> and DROP.  When EVALUATEing the sequence, starting with an empty
> stack, it should end with an empty stack, and should not underflow the
> stack.
> 
> I implemented this twice.  The first attempt just generated a random
> word/number in the normal case, and a specific kind of word/number in
> the boundary cases:
> 
> : genseq {: u -- :}
>      0 u 0 do
>          case
>              dup 0= ?of 1+ ." 9 " endof
>              dup i + u u>= ?of 1- ." drop " endof
>              dup i + 1+ u u>= ?of ." noop " endof
>              rnd 3 um* nip 1- dup case
>                  -1 of ." drop " endof
>                  0  of ." noop " endof
>                  1  of ." 8 " endof
>              endcase
>              +
>              0 endcase
>      loop
>      drop ;
> 
> When I do 16 GENSEQ with seed 1234, I get the following sequence:
> 
> 9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop
> 
> The requirements are distributed across several places in this word,
> which I did not like.  I also did not like the non-randomness of the
> sequence of DROPs at the end, and the non-randomness of the 9 whenever
> the stack is empty.  I also found that I needed to add a clause for
> when the stack depth is one less than the number of remaining words.
> 
> So I thought about other ways to do it.
> 
> Eventually I came up with the following approach: Starts with counters
> for the different kinds of words/numbers.  Whenever one word/number is
> generated, its counter is reduced.  At each step the probability of
> generating a kind of word/number is proportional to the remaining
> counter for that word/number.  The same-depth requirement is
> implemented by having the same initial counter for the number and
> DROP.  The no-underflow requirement is implemented by skipping the
> output of a DROP if the count of remaining DROPs = the count of the
> remaining numbers.
> 
> Here's the code:
> 
> create counts 3 cells allot
> 
> : initcounts ( u -- )
>      dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;
> 
> : .counts ( -- )
>      counts 3 cell array>mem mem+do
>          i @ 4 .r
>      loop ;
> 
> : countsth-- ( u -- )
>      -1 counts rot th +! ;
> 
> : select1 ( u -- )
>      \ u is the index of the word
>      dup case
>          0 of ." 9 "    countsth-- endof
>          1 of ." noop " countsth-- endof
>          2 of counts @ counts 2 th @ u< if
>                  ." drop " countsth--
>              else
>                  drop endif
>          endof
>      endcase ;
> 
> : sum ( addr u -- u2 )
>      \ addr u is an array with u elements, u2 is the sum of the elements
>      0 -rot cell array>mem mem+do
>          i @ +
>      loop ;
> 
> : genseq ( u -- )
>      initcounts
>      begin
>          counts 3 sum dup while
>              rnd um* nip ( sum )
>              3 0 u+do ( sum1 )
>                  counts i th @ 2dup u< if
>                      i select1 then
>                  - loop
>              drop
>      repeat
>      drop ;
> 
> The code is much longer, but I like it better.  The requirements are
> concentrated in two places.
> 
> 16 GENSEQ generates
> 
> 9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop
> 
> for seed 123.  A disadvantage of this approach is that it does not
> generate a word/number on each iteration through the main loop.  I
> have thought about how to achieve this, but the result would be quite
> a bit more complex; it's also only a minor issue.  E.g., for
> generating 1000 words/numbers with seed 123, the main loop has 1036
> iterations.
> 
> The second approach takes quite a bit more time; for generating a
> million words/numbers into a string on a Zen 4, 192_876_892 cycles
> compared to 146_644_156 cycles.
> 
> The original goal was to produce a sequence where various parts of the
> text interpreter have branch mispredictions; the less predictable the
> sequence, the better.  In this respect the second sequence is
> significantly better.  For a sequence of 1000 words/numbers,
> text-interpreted 1000 times, the number of mispredictions is:
> 
> 728_322 first approach
> 821_813 second approach
> 
> (with 247_125 mispredictions coming from (mainly) Gforth startup).
> 
> - anton

[toc] | [prev] | [next] | [standalone]

#134230

From	minforth <minforth@gmx.net>
Date	2025-10-01 20:42 +0200
Message-ID	<mk5b0uFubm8U1@mid.individual.net>
In reply to	#134227

Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
> I used something similar - but with a whole slew of stack operations. 
> Very handy thingy. I use it until this day.
> 
> ( abc -- abcabc)   >r over over r@ rot rot r> \ >r 2dup r@ -rot r>
> 

Different and probably not as efficient as your code generator:

MinForth 3.6
# : 3DUP { a b c == a b c a b c } ;

[toc] | [prev] | [next] | [standalone]

#134231

From	dxf <dxforth@gmail.com>
Date	2025-10-02 19:49 +1000
Message-ID	<68de4aaa@news.ausics.net>
In reply to	#134230

On 2/10/2025 4:42 am, minforth wrote:
> Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
>> I used something similar - but with a whole slew of stack operations. Very handy thingy. I use it until this day.
>>
>> ( abc -- abcabc)   >r over over r@ rot rot r> \ >r 2dup r@ -rot r>
>>
> 
> Different and probably not as efficient as your code generator:
> 
> MinForth 3.6
> # : 3DUP { a b c == a b c a b c } ;

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c )  dup 2over rot ;

With NTF/LFX the locals version will break even.  For others, well, it may
be better not to look.  For a straight-forward example of 'stack juggling',
locals handle it rather poorly.

[toc] | [prev] | [next] | [standalone]

#134232

From	albert@spenarnc.xs4all.nl
Date	2025-10-02 13:07 +0200
Message-ID	<nnd$22ca1fe9$28f84452@be32e8d2a0a178a3>
In reply to	#134231

In article <68de4aaa@news.ausics.net>, dxf  <dxforth@gmail.com> wrote:
>On 2/10/2025 4:42 am, minforth wrote:
>> Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
>>> I used something similar - but with a whole slew of stack operations.
>Very handy thingy. I use it until this day.
>>>
>>> ( abc -- abcabc)   >r over over r@ rot rot r> \ >r 2dup r@ -rot r>
>>>
>>
>> Different and probably not as efficient as your code generator:
>>
>> MinForth 3.6
>> # : 3DUP { a b c == a b c a b c } ;
>
>For 3DUP I believe this is the one to beat:
>
>: 3DUP ( a b c -- a b c a b c )  dup 2over rot ;

CODE 3DUP
    PUSH  RSP[3*CELL_SIZE]
    PUSH  RSP[3*CELL_SIZE]
    PUSH  RSP[3*CELL_SIZE]
    NEXT,

How does that look for 5DUP ?

>
>With NTF/LFX the locals version will break even.  For others, well, it may
>be better not to look.  For a straight-forward example of 'stack juggling',
>locals handle it rather poorly.
>
Groetjes Albert
-- 
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

[toc] | [prev] | [next] | [standalone]

#134237

From	dxf <dxforth@gmail.com>
Date	2025-10-03 18:22 +1000
Message-ID	<68df87cd$1@news.ausics.net>
In reply to	#134232

On 2/10/2025 9:07 pm, albert@spenarnc.xs4all.nl wrote:
> In article <68de4aaa@news.ausics.net>, dxf  <dxforth@gmail.com> wrote:
>> On 2/10/2025 4:42 am, minforth wrote:
>>> Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
>>>> I used something similar - but with a whole slew of stack operations.
>> Very handy thingy. I use it until this day.
>>>>
>>>> ( abc -- abcabc)   >r over over r@ rot rot r> \ >r 2dup r@ -rot r>
>>>>
>>>
>>> Different and probably not as efficient as your code generator:
>>>
>>> MinForth 3.6
>>> # : 3DUP { a b c == a b c a b c } ;
>>
>> For 3DUP I believe this is the one to beat:
>>
>> : 3DUP ( a b c -- a b c a b c )  dup 2over rot ;
> 
> CODE 3DUP
>     PUSH  RSP[3*CELL_SIZE]
>     PUSH  RSP[3*CELL_SIZE]
>     PUSH  RSP[3*CELL_SIZE]
>     NEXT,
> ...

That's equivalent to:

: 3DUP  2 PICK  2 PICK  2 PICK ;

I didn't notice but it's how VFX defines it.  OTOH  DUP 2OVER ROT
optimizes to the same machine code.  IIRC I tested them all on
DX-Forth and settled on the latter.  An 8086 code version would
beat it.  An 8080 code version would be more trouble than worth.

[toc] | [prev] | [next] | [standalone]

#134236 — 3dup again (was: Generating a random sequence of Forth words)

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2025-10-02 20:44 +0000
Subject	3dup again (was: Generating a random sequence of Forth words)
Message-ID	<2025Oct2.224440@mips.complang.tuwien.ac.at>
In reply to	#134231

dxf <dxforth@gmail.com> writes:
>For 3DUP I believe this is the one to beat:
>
>: 3DUP ( a b c -- a b c a b c )  dup 2over rot ;
>
>With NTF/LFX the locals version will break even.

As we already discussed in the thread including
<2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
same (optimal for the calling convention used by NTF/LXF) code for
3DUP versions using the data stack, return stack, or locals.  That's
because the actual data flow is always the same, and NTF/LXF can see
this data flow in all three cases.

>For others, well, it may
>be better not to look.  For a straight-forward example of 'stack juggling',
>locals handle it rather poorly.

Other Forth systems implement locals poorly.  LXF/NTF demonstrates
that this is not due to some natural law, however.

There have been some improvements in Gforth since that time.  Let's
see how the versions used in that thread look on today's gforth-fast.
Here are the versions of 3DUP:

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

And here's the gforth-fast code on AMD64:

3dup.1              3dup.2             3dup.3              3dup.4           
>r    1->0          third    1->2      >l >l 1->1          dup    1->1      
  mov -$08[r14],r13   mov r15,$10[r10] >l    1->1            mov [r10],r13  
  sub r14,$08       third    2->3        mov -$08[rbp],r13   sub r10,$08    
2dup    0->2          mov r9,$08[r10]    mov rdx,$08[r10]  2over    1->3    
  mov r13,$10[r10]  third    3->1        mov rax,rbp         mov r15,$18[r10
  mov r15,$08[r10]    mov [r10],r13      add r10,$10         mov r9,$10[r10]
i    2->3             sub r10,$18        lea rbp,-$10[rbp] rot    3->1      
  mov r9,[r14]        mov $10[r10],r15   mov -$10[rax],rdx   mov [r10],r15  
-rot    3->2          mov $08[r10],r9    mov r13,[r10]       sub r10,$10    
  mov [r10],r9      ;s    1->1         >l @local0 1->1       mov $08[r10],r9
  sub r10,$08         mov rbx,[r14]    @local0    1->1     ;s    1->1       
r>    2->1            add r14,$08        mov rax,rbp         mov rbx,[r14]  
  mov -$08[r10],r15   mov rax,[rbx]      lea rbp,-$08[rbp]   add r14,$08    
  sub r10,$10         jmp eax            mov -$08[rax],r13   mov rax,[rbx]  
  mov $10[r10],r13                     @local1    1->2       jmp eax        
  mov r13,[r14]                          mov r15,$08[rbp]                
  add r14,$08                          @local2    2->1     
;s    1->1                               mov -$08[r10],r15 
  mov rbx,[r14]                          sub r10,$10       
  add r14,$08                            mov $10[r10],r13  
  mov rax,[rbx]                          mov r13,$10[rbp]  
  jmp eax                              @local0    1->2     
                                         mov r15,$00[rbp]  
                                       @local1    2->3     
                                         mov r9,$08[rbp]   
                                       @local2    3->1     
                                         mov -$10[r10],r9  
                                         sub r10,$18       
                                         mov $10[r10],r15  
                                         mov $18[r10],r13  
                                         mov r13,$10[rbp]  
                                       lit    1->2         
                                       #24                 
                                         mov r15,$50[rbx]  
                                       lp+!    2->1        
                                         add rbp,r15       
                                       ;s    1->1          
                                         mov rbx,[r14]     
                                         add r14,$08       
                                         mov rax,[rbx]     
                                         jmp eax           

Locals-haters, come to Gforth, where locals are implemented
inefficiently:-).  The code for 3DUP.2 is actually optimal for
Gforth's calling convention.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
 New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [next] | [standalone]

#134238 — Re: 3dup again (was: Generating a random sequence of Forth words)

From	albert@spenarnc.xs4all.nl
Date	2025-10-03 11:02 +0200
Subject	Re: 3dup again (was: Generating a random sequence of Forth words)
Message-ID	<nnd$36c0a5f5$2a9d9179@1179128ec025831e>
In reply to	#134236

In article <2025Oct2.224440@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>dxf <dxforth@gmail.com> writes:
>>For 3DUP I believe this is the one to beat:
>>
>>: 3DUP ( a b c -- a b c a b c )  dup 2over rot ;
>>
>>With NTF/LFX the locals version will break even.
>
>As we already discussed in the thread including
><2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
>same (optimal for the calling convention used by NTF/LXF) code for
>3DUP versions using the data stack, return stack, or locals.  That's
>because the actual data flow is always the same, and NTF/LXF can see
>this data flow in all three cases.

The problem with 3DUP is that it is actually used in context.
What is the data that is going to 3DUP ped? In my view this
amounts to double use of data that is in registers (32 in the riscv)
anyway, after an optimiser does his thing.

<SNIP>

>- anton
-- 
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

[toc] | [prev] | [next] | [standalone]

#134239 — Re: 3dup again

From	minforth <minforth@gmx.net>
Date	2025-10-03 11:09 +0200
Subject	Re: 3dup again
Message-ID	<mk9i5hFld65U1@mid.individual.net>
In reply to	#134238

Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:
> In article <2025Oct2.224440@mips.complang.tuwien.ac.at>,
> Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>> dxf <dxforth@gmail.com> writes:
>>> For 3DUP I believe this is the one to beat:
>>>
>>> : 3DUP ( a b c -- a b c a b c )  dup 2over rot ;
>>>
>>> With NTF/LFX the locals version will break even.
>>
>> As we already discussed in the thread including
>> <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
>> same (optimal for the calling convention used by NTF/LXF) code for
>> 3DUP versions using the data stack, return stack, or locals.  That's
>> because the actual data flow is always the same, and NTF/LXF can see
>> this data flow in all three cases.
> 
> The problem with 3DUP is that it is actually used in context.
> What is the data that is going to 3DUP ped? In my view this
> amounts to double use of data that is in registers (32 in the riscv)
> anyway, after an optimiser does his thing.
> 

Code inlining will mend it.

[toc] | [prev] | [next] | [standalone]

#134245 — Re: 3dup again

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2025-10-04 08:04 +0000
Subject	Re: 3dup again
Message-ID	<2025Oct4.100409@mips.complang.tuwien.ac.at>
In reply to	#134239

minforth <minforth@gmx.net> writes:
>Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:
>> The problem with 3DUP is that it is actually used in context.
>> What is the data that is going to 3DUP ped? In my view this
>> amounts to double use of data that is in registers (32 in the riscv)
>> anyway, after an optimiser does his thing.
>> 
>
>Code inlining will mend it.

Inlining is important for Forth, but it does not make what has been
called an "analytical optimizer" unnecessary; on the contraray,
inlining increases the benefit we get from the analytical optimizer.
E.g., let's consider

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

: foo.1 3dup.1 + ! ;
: foo.2 3dup.2 + ! ;
: foo.3 3dup.3 + ! ;
: foo.4 3dup.4 + ! ;

The result produced by VFX64 is:

foo.1              foo.2              foo.3              foo.4
PUSH EBX           MOV  EDX, EBX      CALL 3DUP.3        MOV  EDX, EBX     
MOV  EBX, [ESP]    ADD  EBX, [EBP]    ADD  EBX, [EBP]    ADD  EBX, [EBP]   
POP  EDX           MOV  ECX, [EBP+04] MOV  EDX, [EBP+04] MOV  ECX, [EBP+04]
ADD  EDX, [EBP]    MOV  0 [EBX], ECX  MOV  0 [EBX], EDX  MOV  0 [EBX], ECX 
MOV  ECX, [EBP+04] MOV  EBX, EDX      MOV  EBX, [EBP+08] MOV  EBX, EDX     
MOV  0 [EDX], ECX  NEXT,              LEA  EBP, [EBP+0C] NEXT,             
NEXT,                                 NEXT,             

VFX is only analytical about the data stack, and as a consequence, the
implementations of 3dup that only use the data stack work best.  When
the return stack is used, as in 3dup.1/foo.1, VFX produces
instructions (PUSH for >R, MOV ..., [ESP] for R@ and POP for R>) for
the return-stack operations.  When locals are used, VFX actually
disables inlining and just calls 3DUP.3.

Other Forth systems make too little use of inlining, and I have to
resort to macros to simulate it.  We cannot use proper macros for
3dup.3 (the locals-using variant), so I used EVALUATE-based macros;
this is just for experimental use, not for production, don't do this
at home:-)

Let's see what VFX64 produces for FOO.3 with this:

FOO.3 
( 080C0C50    8BD4 )                  MOV       EDX, ESP
( 080C0C52    FF7504 )                PUSH      [EBP+04]
( 080C0C55    FF7500 )                PUSH      [EBP]
( 080C0C58    53 )                    PUSH      EBX
( 080C0C59    52 )                    PUSH      EDX
( 080C0C5A    57 )                    PUSH      EDI
( 080C0C5B    8BFC )                  MOV       EDI, ESP
( 080C0C5D    81EC00000000 )          SUB       ESP, 00000000
( 080C0C63    8B5D08 )                MOV       EBX, [EBP+08]
( 080C0C66    8D6D0C )                LEA       EBP, [EBP+0C]
( 080C0C69    8B5708 )                MOV       EDX, [EDI+08]
( 080C0C6C    03570C )                ADD       EDX, [EDI+0C]
( 080C0C6F    8B4F08 )                MOV       ECX, [EDI+08]
( 080C0C72    8B470C )                MOV       EAX, [EDI+0C]
( 080C0C75    8D6DEC )                LEA       EBP, [EBP+-14]
( 080C0C78    894D04 )                MOV       [EBP+04], ECX
( 080C0C7B    894508 )                MOV       [EBP+08], EAX
( 080C0C7E    8B4F10 )                MOV       ECX, [EDI+10]
( 080C0C81    894D0C )                MOV       [EBP+0C], ECX
( 080C0C84    895D10 )                MOV       [EBP+10], EBX
( 080C0C87    8BDA )                  MOV       EBX, EDX
( 080C0C89    8B5710 )                MOV       EDX, [EDI+10]
( 080C0C8C    895500 )                MOV       [EBP], EDX
( 080C0C8F    8B5500 )                MOV       EDX, [EBP]
( 080C0C92    8913 )                  MOV       0 [EBX], EDX
( 080C0C94    8B5D04 )                MOV       EBX, [EBP+04]
( 080C0C97    8D6D08 )                LEA       EBP, [EBP+08]
( 080C0C9A    8B6704 )                MOV       ESP, [EDI+04]
( 080C0C9D    8B3F )                  MOV       EDI, 0 [EDI]
( 080C0C9F    C3 )                    NEXT,

So inlining did not mend that.

Here's what lxf produces:

foo.1              foo.2              foo.3              foo.4
mov eax , ebx      mov eax , ebx      mov eax , ebx      mov eax , ebx     
add eax , [ebp]    add eax , [ebp]    add eax , [ebp]    add eax , [ebp]   
mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h]
mov [eax] , ecx    mov [eax] , ecx    mov [eax] , ecx    mov [eax] , ecx   
ret near           ret near           ret near           ret near          

So, because lxf is analytical about the return stack (and, through
that, about locals), inlining produces the same very good code in all
these cases.

You may notice that lxf produces a register-register move less than
VFX does for FOO.2/FOO.4.  That's because VFX decided to modify the
TOS register (and has to restore it later), whereas lxf decided to
modify a copy of that register.  One would have to make additional
observations to determine if lxf was just lucky here or if it
consistently makes the right decision in such cases.

And here's the code that gforth-fast (which does not have an
analytical optimizer) produces:

foo.1              foo.2              foo.3                foo.4
>r    1->0         third    1->1      >l >l 1->1           dup    1->1       
  mov -8[r14],r13    mov [r10],r13    >l    1->1             mov [r10],r13   
  sub r14,$08        sub r10,$08        mov -$08[rbp],r13    sub r10,$08     
2dup    0->2         mov r13,$18[r10]   mov rdx,$08[r10]   2over    1->3     
  mov r13,$10[r10] third    1->2        mov rax,rbp          mov r15,$18[r10]
  mov r15,$08[r10]   mov r15,$10[r10]   add r10,$10          mov r9,$10[r10] 
i    2->3          third    2->3        lea rbp,-$10[rbp]  rot    3->3       
  mov r9,[r14]       mov r9,$08[r10]    mov -$10[rax],rdx    mov rax,r13     
-rot    3->2       +    3->2            mov r13,[r10]        mov r13,r15     
  mov [r10],r9       add r15,r9       >l @local0 1->1        mov r15,r9      
  sub r10,$08      !    2->0          @local0    1->1        mov r9,rax      
r>    2->3           mov [r15],r13      mov rax,rbp        +    3->2         
  mov r9,[r14]     ;s    0->1           lea rbp,-$08[rbp]    add r15,r9      
  add r14,$08        mov r13,$08[r10]   mov -$08[rax],r13  !    2->0         
+    3->2            add r10,$08      @local1    1->2        mov [r15],r13   
  add r15,r9         mov rbx,[r14]      mov r15,$08[rbp]   ;s    0->1        
!    2->0            add r14,$08      @local2    2->3        mov r13,$08[r10]
  mov [r15],r13      mov rax,[rbx]      mov r9,$10[rbp]      add r10,$08     
;s    0->1           jmp eax          @local0    3->1        mov rbx,[r14]   
  mov r13,$08[r10]                      mov -$10[r10],r9     add r14,$08     
  add r10,$08                           sub r10,$18          mov rax,[rbx]   
  mov rbx,[r14]                         mov $10[r10],r15     jmp eax         
  add r14,$08                           mov $18[r10],r13 
  mov rax,[rbx]                         mov r13,$00[rbp] 
  jmp eax                             @local1    1->2    
                                        mov r15,$08[rbp] 
                                      @local2    2->3    
                                        mov r9,$10[rbp]  
                                      +    3->2          
                                        add r15,r9       
                                      !    2->0          
                                        mov [r15],r13    
                                      lit    0->1        
                                      #24                
                                        mov r13,$60[rbx] 
                                      lp+!    1->1       
                                        add r10,$08      
                                        add rbp,r13      
                                        mov r13,[r10]    
                                      ;s    1->1         
                                        mov rbx,[r14]    
                                        add r14,$08      
                                        mov rax,[rbx]    
                                        jmp eax          

Here inlining helps a little, but the disadvantages of the approach
are still obvious.  With less optimization (e.g., no stack caching),
inlining would have helped even less.

And while we are at it, here's SwiftForth:


foo.1             foo.2                 foo.3              foo.4
RBX PUSH          -8 [RBP] RBP LEA      -8 [RBP] RBP LEA   -8 [RBP] RBP LEA
0 [RBP] RBX MOV   RBX 0 [RBP] MOV       RBX 0 [RBP] MOV    RBX 0 [RBP] MOV
8 [RBP] RBP LEA   10 [RBP] RBX MOV      18 # EBX MOV       -10 [RBP] RBP LEA
-10 [RBP] RBP LEA -8 [RBP] RBP LEA      LSPACE CALL        RBX 8 [RBP] MOV
RBX 8 [RBP] MOV   RBX 0 [RBP] MOV       RBX 10 [R13] MOV   20 [RBP] RAX MOV
10 [RBP] RAX MOV  10 [RBP] RBX MOV      0 [RBP] RBX MOV    RAX 0 [RBP] MOV
RAX 0 [RBP] MOV   -8 [RBP] RBP LEA      8 [RBP] RBP LEA    18 [RBP] RBX MOV
-8 [RBP] RBP LEA  RBX 0 [RBP] MOV       RBX 8 [R13] MOV    RBX RCX MOV
RBX 0 [RBP] MOV   10 [RBP] RBX MOV      0 [RBP] RBX MOV    8 [RBP] RBX MOV
0 [RSP] RBX MOV   0 [RBP] RAX MOV       8 [RBP] RBP LEA    0 [RBP] RAX MOV
RBX RCX MOV       8 [RBP] RCX MOV       RBX 0 [R13] MOV    RAX 8 [RBP] MOV
8 [RBP] RAX MOV   RCX 0 [RBX] [RAX] MOV 0 [RBP] RBX MOV    RCX 0 [RBP] MOV
0 [RBP] RBX MOV   10 [RBP] RBX MOV      8 [RBP] RBP LEA    0 [RBP] RAX MOV
RAX 0 [RBP] MOV   18 [RBP] RBP LEA      -8 [RBP] RBP LEA   8 [RBP] RCX MOV
RCX 8 [RBP] MOV   RET                   RBX 0 [RBP] MOV    RCX 0 [RBX] [RAX] MOV
RAX POP                                 0 [R13] RBX MOV    10 [RBP] RBX MOV
RAX RBX ADD                             -8 [RBP] RBP LEA   18 [RBP] RBP LEA
0 [RBP] RAX MOV                         RBX 0 [RBP] MOV    RET
RAX 0 [RBX] MOV                         8 [R13] RBX MOV
8 [RBP] RBX MOV                         -8 [RBP] RBP LEA
10 [RBP] RBP LEA                        RBX 0 [RBP] MOV
RET                                     10 [R13] RBX MOV
                                        -8 [RBP] RBP LEA
                                        RBX 0 [RBP] MOV
                                        0 [R13] RBX MOV
                                        -8 [RBP] RBP LEA
                                        RBX 0 [RBP] MOV
                                        8 [R13] RBX MOV
                                        -8 [RBP] RBP LEA
                                        RBX 0 [RBP] MOV
                                        10 [R13] RBX MOV
                                        0 [RBP] RAX MOV
                                        8 [RBP] RCX MOV
                                        RCX 0 [RBX] [RAX] MOV
                                        10 [RBP] RBX MOV
                                        18 [RBP] RBP LEA
                                        RET


And finally, iForth:

foo.1/foo.4              foo.2                      foo.3
pop  rbx                 mov rbx, [rsp #16 +] qword pop  rbx
pop  rdi                 mov  rcx, rbx              lea rsi, [rsi #-16 +] qword
pop  rax                 mov rbx, [rsp 8 +] qword   mov [esi] dword, rbx
mov [edi ebx*1]dword,rax push rcx                   pop  rbx
push rax                 mov  rcx, rbx              lea rsi, [rsi #-16 +] qword
push rdi                 mov rbx, [rsp 8 +] qword   mov [esi] dword, rbx
push rbx                 pop  rdi                   pop  rbx
;                        mov [ecx ebx*1] dword, rdi lea rsi, [rsi #-16 +] qword
                         ;                          mov [esi] dword, rbx
                                                    mov rbx, [rsi #16 +] qword
                                                    add rbx, [rsi #32 +] qword
                                                    mov rdi, [rsi] qword
                                                    mov rax, [rsi #16 +] qword
                                                    mov rdx, [rsi #32 +] qword
                                                    mov [ebx] dword, rdi
                                                    push rdi
                                                    push rax
                                                    push rdx
                                                    add  rsi, #48 b#
                                                    ;

It's interesting that Gforth produced the same code for FOO.1 and
FOO.4, but different code for FOO.2.  Both variants are suboptimal
IMO, because the contain unnecessary pushes.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
 New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [next] | [standalone]

#134251 — Re: 3dup again

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2025-10-05 11:29 +0200
Subject	Re: 3dup again
Message-ID	<nnd$20413d56$44555268@6e1be6ac6c78e86b>
In reply to	#134236

On 02-10-2025 22:44, Anton Ertl wrote:
 > Locals-haters, come to Gforth, where locals are implemented
 > inefficiently:-).  The code for 3DUP.2 is actually optimal for
 > Gforth's calling convention.

I think I can beat you at that :)

1. This is the LOCAL definition

   Addr| Opcode                        Operand   Argument

     35| branch                             46   local
     36| r>                                  0
     37| swap                                0
     38| dup                                 0
     39| >r                                  0
     40| @                                   0
     41| >r                                  0
     42| execute                             0
     43| r>                                  0
     44| r>                                  0
     45| !                                   0
     46| exit                                0

2. This is the 3DUP with locals


   Addr| Opcode                        Operand   Argument

     54| branch                             82   3dup.1
     55| literal                             0
     56| to                                  0   a
     57| literal                             0
     58| environ                             1
     59| +                                   0
     60| call                               35   local
     61| literal                             0
     62| to                                  1   b
     63| literal                             1
     64| environ                             1
     65| +                                   0
     66| call                               35   local
     67| literal                             0
     68| to                                  2   c
     69| literal                             2
     70| environ                             1
     71| +                                   0
     72| call                               35   local
     73| to                                  2   c
     74| to                                  1   b
     75| to                                  0   a
     76| value                               0   a
     77| value                               1   b
     78| value                               2   c
     79| value                               0   a
     80| value                               1   b
     81| value                               2   c
     82| exit                                0

3. This is 3DUP *without* locals

   Addr| Opcode                        Operand   Argument

     83| branch                             91   3dup.2
     84| >r                                  0
     85| over                                0
     86| over                                0
     87| r@                                  0
     88| rot                                 0
     89| rot                                 0
     90| r>                                  0
     91| exit                                0

And this is the sourcecode:

include lib/anstools.4th
include 4pp/lib/alocals.4pp

: clear depth 0 ?do drop loop ;
: 3dup.1 {: a b c -- a b c a b c :} a b c a b c ;
: 3dup.2 >r 2dup r@ -rot r> ;

1 2 3 3dup.1 .s clear
4 5 6 3dup.2 .s clear

Yeah, heavy use of the preprocessor. This is the expanded source:

: 3dup.1
[UNDEFINED] a [IF] 0 value a [THEN] ['] a >body local
[UNDEFINED] b [IF] 0 value b [THEN] ['] b >body local
[UNDEFINED] c [IF] 0 value c [THEN] ['] c >body local
to c to b to a a b c a b c ;

Maybe now the decompilation makes sense ;-)

Hans Bezemer

> dxf <dxforth@gmail.com> writes:
>> For 3DUP I believe this is the one to beat:
>>
>> : 3DUP ( a b c -- a b c a b c )  dup 2over rot ;
>>
>> With NTF/LFX the locals version will break even.
> 
> As we already discussed in the thread including
> <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
> same (optimal for the calling convention used by NTF/LXF) code for
> 3DUP versions using the data stack, return stack, or locals.  That's
> because the actual data flow is always the same, and NTF/LXF can see
> this data flow in all three cases.
> 
>> For others, well, it may
>> be better not to look.  For a straight-forward example of 'stack juggling',
>> locals handle it rather poorly.
> 
> Other Forth systems implement locals poorly.  LXF/NTF demonstrates
> that this is not due to some natural law, however.
> 
> There have been some improvements in Gforth since that time.  Let's
> see how the versions used in that thread look on today's gforth-fast.
> Here are the versions of 3DUP:
> 
> : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
> : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
> : 3dup.3 {: a b c :} a b c a b c ;
> : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
> 
> And here's the gforth-fast code on AMD64:
> 
> 3dup.1              3dup.2             3dup.3              3dup.4
>> r    1->0          third    1->2      >l >l 1->1          dup    1->1
>    mov -$08[r14],r13   mov r15,$10[r10] >l    1->1            mov [r10],r13
>    sub r14,$08       third    2->3        mov -$08[rbp],r13   sub r10,$08
> 2dup    0->2          mov r9,$08[r10]    mov rdx,$08[r10]  2over    1->3
>    mov r13,$10[r10]  third    3->1        mov rax,rbp         mov r15,$18[r10
>    mov r15,$08[r10]    mov [r10],r13      add r10,$10         mov r9,$10[r10]
> i    2->3             sub r10,$18        lea rbp,-$10[rbp] rot    3->1
>    mov r9,[r14]        mov $10[r10],r15   mov -$10[rax],rdx   mov [r10],r15
> -rot    3->2          mov $08[r10],r9    mov r13,[r10]       sub r10,$10
>    mov [r10],r9      ;s    1->1         >l @local0 1->1       mov $08[r10],r9
>    sub r10,$08         mov rbx,[r14]    @local0    1->1     ;s    1->1
> r>    2->1            add r14,$08        mov rax,rbp         mov rbx,[r14]
>    mov -$08[r10],r15   mov rax,[rbx]      lea rbp,-$08[rbp]   add r14,$08
>    sub r10,$10         jmp eax            mov -$08[rax],r13   mov rax,[rbx]
>    mov $10[r10],r13                     @local1    1->2       jmp eax
>    mov r13,[r14]                          mov r15,$08[rbp]
>    add r14,$08                          @local2    2->1
> ;s    1->1                               mov -$08[r10],r15
>    mov rbx,[r14]                          sub r10,$10
>    add r14,$08                            mov $10[r10],r13
>    mov rax,[rbx]                          mov r13,$10[rbp]
>    jmp eax                              @local0    1->2
>                                           mov r15,$00[rbp]
>                                         @local1    2->3
>                                           mov r9,$08[rbp]
>                                         @local2    3->1
>                                           mov -$10[r10],r9
>                                           sub r10,$18
>                                           mov $10[r10],r15
>                                           mov $18[r10],r13
>                                           mov r13,$10[rbp]
>                                         lit    1->2
>                                         #24
>                                           mov r15,$50[rbx]
>                                         lp+!    2->1
>                                           add rbp,r15
>                                         ;s    1->1
>                                           mov rbx,[r14]
>                                           add r14,$08
>                                           mov rax,[rbx]
>                                           jmp eax
> 
> 
> - anton

[toc] | [prev] | [next] | [standalone]

#134261

From	antispam@fricas.org (Waldek Hebisch)
Date	2025-10-15 19:19 +0000
Message-ID	<10cos4q$3de9o$1@paganini.bofh.team>
In reply to	#134221

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> I wanted to generate a random sequence consisting of a number, NOOP,
> and DROP.  When EVALUATEing the sequence, starting with an empty
> stack, it should end with an empty stack, and should not underflow the
> stack.
> 
> I implemented this twice.  The first attempt just generated a random
> word/number in the normal case, and a specific kind of word/number in
> the boundary cases:
> 
> : genseq {: u -- :}
>     0 u 0 do
>         case 
>             dup 0= ?of 1+ ." 9 " endof
>             dup i + u u>= ?of 1- ." drop " endof
>             dup i + 1+ u u>= ?of ." noop " endof
>             rnd 3 um* nip 1- dup case
>                 -1 of ." drop " endof
>                 0  of ." noop " endof
>                 1  of ." 8 " endof
>             endcase
>             +
>             0 endcase
>     loop
>     drop ;
> 
> When I do 16 GENSEQ with seed 1234, I get the following sequence:
> 
> 9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop
> 
> The requirements are distributed across several places in this word,
> which I did not like.  I also did not like the non-randomness of the
> sequence of DROPs at the end, and the non-randomness of the 9 whenever
> the stack is empty.  I also found that I needed to add a clause for
> when the stack depth is one less than the number of remaining words.
> 
> So I thought about other ways to do it.
> 
> Eventually I came up with the following approach: Starts with counters
> for the different kinds of words/numbers.  Whenever one word/number is
> generated, its counter is reduced.  At each step the probability of
> generating a kind of word/number is proportional to the remaining
> counter for that word/number.  The same-depth requirement is
> implemented by having the same initial counter for the number and
> DROP.  The no-underflow requirement is implemented by skipping the
> output of a DROP if the count of remaining DROPs = the count of the
> remaining numbers.
> 
> Here's the code:
> 
> create counts 3 cells allot
> 
> : initcounts ( u -- )
>     dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;
> 
> : .counts ( -- )
>     counts 3 cell array>mem mem+do
>         i @ 4 .r
>     loop ;
> 
> : countsth-- ( u -- )
>     -1 counts rot th +! ;
> 
> : select1 ( u -- )
>     \ u is the index of the word
>     dup case
>         0 of ." 9 "    countsth-- endof
>         1 of ." noop " countsth-- endof
>         2 of counts @ counts 2 th @ u< if
>                 ." drop " countsth--
>             else
>                 drop endif
>         endof
>     endcase ;
> 
> : sum ( addr u -- u2 )
>     \ addr u is an array with u elements, u2 is the sum of the elements
>     0 -rot cell array>mem mem+do
>         i @ +
>     loop ;
> 
> : genseq ( u -- )
>     initcounts
>     begin
>         counts 3 sum dup while
>             rnd um* nip ( sum )
>             3 0 u+do ( sum1 )
>                 counts i th @ 2dup u< if
>                     i select1 then
>                 - loop
>             drop
>     repeat
>     drop ;
> 
> The code is much longer, but I like it better.  The requirements are
> concentrated in two places.
> 
> 16 GENSEQ generates
> 
> 9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop
> 
> for seed 123.  A disadvantage of this approach is that it does not
> generate a word/number on each iteration through the main loop.  I
> have thought about how to achieve this, but the result would be quite
> a bit more complex; it's also only a minor issue.  E.g., for
> generating 1000 words/numbers with seed 123, the main loop has 1036
> iterations.
> 
> The second approach takes quite a bit more time; for generating a
> million words/numbers into a string on a Zen 4, 192_876_892 cycles
> compared to 146_644_156 cycles.
> 
> The original goal was to produce a sequence where various parts of the
> text interpreter have branch mispredictions; the less predictable the
> sequence, the better.  In this respect the second sequence is
> significantly better.  For a sequence of 1000 words/numbers,
> text-interpreted 1000 times, the number of mispredictions is:
> 
> 728_322 first approach
> 821_813 second approach
> 
> (with 247_125 mispredictions coming from (mainly) Gforth startup).

In related problem I used the following technique.  Basic part
is: at each position you need to generate next step with right
probablity.  How to know probabilities?  Just count all possible
continuations.  To do this with reasonable efficiency start
from the end and observe that legality of a continuation depends
only on number of elements on the stack, but does not depend on
sequence of operations that put the items on the stack.  Once
you know counts for all contunuations of length k you can compute
counts for contunuations of length k + 1 (each possible first step
leads to new state for which count is known).

This produces uniform distribution, so in this sense gives
pretty good randomness.  OTOH size of table of counts grows
quadratically with length, so it may be too costly to
compute for long sequences.  But if one wants several sequences
of modest length (my case) cost of table is amortized over
many uses.

-- 
                              Waldek Hebisch

[toc] | [prev] | [next] | [standalone]

#134266

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2025-10-24 15:55 +0000
Message-ID	<2025Oct24.175521@mips.complang.tuwien.ac.at>
In reply to	#134261

antispam@fricas.org (Waldek Hebisch) writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>> I wanted to generate a random sequence consisting of a number, NOOP,
>> and DROP.  When EVALUATEing the sequence, starting with an empty
>> stack, it should end with an empty stack, and should not underflow the
>> stack.
>> 
>> I implemented this twice.  The first attempt just generated a random
>> word/number in the normal case, and a specific kind of word/number in
>> the boundary cases:
>> 
>> : genseq {: u -- :}
>>     0 u 0 do
>>         case 
>>             dup 0= ?of 1+ ." 9 " endof
>>             dup i + u u>= ?of 1- ." drop " endof
>>             dup i + 1+ u u>= ?of ." noop " endof
>>             rnd 3 um* nip 1- dup case
>>                 -1 of ." drop " endof
>>                 0  of ." noop " endof
>>                 1  of ." 8 " endof
>>             endcase
>>             +
>>             0 endcase
>>     loop
>>     drop ;
>> 
>> When I do 16 GENSEQ with seed 1234, I get the following sequence:
>> 
>> 9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop
>> 
>> The requirements are distributed across several places in this word,
>> which I did not like.  I also did not like the non-randomness of the
>> sequence of DROPs at the end, and the non-randomness of the 9 whenever
>> the stack is empty.  I also found that I needed to add a clause for
>> when the stack depth is one less than the number of remaining words.
>> 
>> So I thought about other ways to do it.
>> 
>> Eventually I came up with the following approach: Starts with counters
>> for the different kinds of words/numbers.  Whenever one word/number is
>> generated, its counter is reduced.  At each step the probability of
>> generating a kind of word/number is proportional to the remaining
>> counter for that word/number.  The same-depth requirement is
>> implemented by having the same initial counter for the number and
>> DROP.  The no-underflow requirement is implemented by skipping the
>> output of a DROP if the count of remaining DROPs = the count of the
>> remaining numbers.
>> 
>> Here's the code:
>> 
>> create counts 3 cells allot
>> 
>> : initcounts ( u -- )
>>     dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;
>> 
>> : .counts ( -- )
>>     counts 3 cell array>mem mem+do
>>         i @ 4 .r
>>     loop ;
>> 
>> : countsth-- ( u -- )
>>     -1 counts rot th +! ;
>> 
>> : select1 ( u -- )
>>     \ u is the index of the word
>>     dup case
>>         0 of ." 9 "    countsth-- endof
>>         1 of ." noop " countsth-- endof
>>         2 of counts @ counts 2 th @ u< if
>>                 ." drop " countsth--
>>             else
>>                 drop endif
>>         endof
>>     endcase ;
>> 
>> : sum ( addr u -- u2 )
>>     \ addr u is an array with u elements, u2 is the sum of the elements
>>     0 -rot cell array>mem mem+do
>>         i @ +
>>     loop ;
>> 
>> : genseq ( u -- )
>>     initcounts
>>     begin
>>         counts 3 sum dup while
>>             rnd um* nip ( sum )
>>             3 0 u+do ( sum1 )
>>                 counts i th @ 2dup u< if
>>                     i select1 then
>>                 - loop
>>             drop
>>     repeat
>>     drop ;
>> 
>> The code is much longer, but I like it better.  The requirements are
>> concentrated in two places.
>> 
>> 16 GENSEQ generates
>> 
>> 9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop
>> 
>> for seed 123.  A disadvantage of this approach is that it does not
>> generate a word/number on each iteration through the main loop.  I
>> have thought about how to achieve this, but the result would be quite
>> a bit more complex; it's also only a minor issue.  E.g., for
>> generating 1000 words/numbers with seed 123, the main loop has 1036
>> iterations.
>> 
>> The second approach takes quite a bit more time; for generating a
>> million words/numbers into a string on a Zen 4, 192_876_892 cycles
>> compared to 146_644_156 cycles.
>> 
>> The original goal was to produce a sequence where various parts of the
>> text interpreter have branch mispredictions; the less predictable the
>> sequence, the better.  In this respect the second sequence is
>> significantly better.  For a sequence of 1000 words/numbers,
>> text-interpreted 1000 times, the number of mispredictions is:
>> 
>> 728_322 first approach
>> 821_813 second approach
>> 
>> (with 247_125 mispredictions coming from (mainly) Gforth startup).
>


>In related problem I used the following technique.  Basic part
>is: at each position you need to generate next step with right
>probablity.  How to know probabilities?  Just count all possible
>continuations.  To do this with reasonable efficiency start
>from the end and observe that legality of a continuation depends
>only on number of elements on the stack, but does not depend on
>sequence of operations that put the items on the stack.  Once
>you know counts for all contunuations of length k you can compute
>counts for contunuations of length k + 1 (each possible first step
>leads to new state for which count is known).
>
>This produces uniform distribution, so in this sense gives
>pretty good randomness.  OTOH size of table of counts grows
>quadratically with length

It's unclear to me what benefit this table provides over my method
which uses just counts.  My counts represent the number of outstanding
9s, NOOPs, and DROPs.  And the counts are used for the probabilities
of the words used for the random selection (except that a selection is
suppressed if it is not legal (i.e., a DROP when the stack is empty)).

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/

[toc] | [prev] | [standalone]

csiph-web

Generating a random sequence of Forth words

Contents

#134221 — Generating a random sequence of Forth words

#134224

#134229

#134227

#134230

#134231

#134232

#134237

#134236 — 3dup again (was: Generating a random sequence of Forth words)

#134238 — Re: 3dup again (was: Generating a random sequence of Forth words)

#134239 — Re: 3dup again

#134245 — Re: 3dup again

#134251 — Re: 3dup again

#134261

#134266