Groups > comp.arch > #114964 > unrolled thread

A useless machine

Started by	Thomas Koenig <tkoenig@netcologne.de>
First post	2026-02-14 09:57 +0000
Last post	2026-02-18 21:56 +0100
Articles	20 on this page of 81 — 12 participants

Back to article view | Back to comp.arch

  A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-14 09:57 +0000
    Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 18:48 +0000
      Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-14 20:18 +0100
        Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 21:22 +0000
          Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-14 13:30 -0800
          Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 12:50 +0100
            Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 13:48 -0600
              Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-21 21:40 +0100
                Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 15:17 -0600
                  Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 04:50 -0600
                Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:01 -0700
              Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 14:43 +0100
                Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 12:38 -0600
                  Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:51 +0100
                    Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-26 01:27 +0000
                      Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-26 09:21 +0100
                      Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:11 -0700
        Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:58 +0100
        Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 12:27 -0600
          Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 15:14 +0100
        Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-22 11:41 -0500
          Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:53 +0100
            Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-23 10:52 -0500
    Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-02-14 21:19 -0800
    Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 03:58 -0500
      Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 13:14 +0100
        Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 15:46 +0000
        Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 10:56 -0500
          Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-15 19:16 +0000
            Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:40 +0100
              Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-26 19:31 +0000
          Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-18 18:22 +0200
            Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-18 22:56 -0500
              Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 12:10 +0200
                Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-21 01:56 -0500
                  Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 19:23 +0200
                  Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 20:36 +0200
                    Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 10:46 +0200
                      Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-22 12:45 +0100
                        Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 16:35 +0200
                    Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:12 -0500
                      Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:47 -0500
                  Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-21 15:39 -0500
                    Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 03:16 +0200
                      Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-22 14:48 -0500
                        Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 23:25 +0200
              Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-19 11:51 +0100
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 23:12 +0200
                  Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-20 15:11 +0100
              Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-19 15:27 -0500
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-20 11:41 +0200
                Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 18:41 -0600
                  Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-23 11:10 +0000
                    Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-23 17:24 -0600
                      Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 06:44 +0000
                        Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 01:38 -0600
                          Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 03:16 -0600
                            Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 04:52 -0600
                          Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 11:16 +0000
                            Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 13:19 -0600
                        Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:14 -0700
            Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-20 08:01 +0000
      Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 15:35 +0200
        Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 18:33 +0200
          Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:19 +0000
            Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:31 +0200
              Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 15:41 +0200
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 18:17 +0200
      Re: A useless machine anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-02-15 22:41 +0000
    Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:52 +0100
      Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 16:45 +0000
        Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 18:01 +0100
          Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:42 +0000
            Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:59 +0100
              Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 21:53 -0500
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:40 +0200
                  Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 20:34 +0200
                Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-16 16:08 +0100
    Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-16 07:50 -0800
      Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 18:31 +0200
      Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-18 21:56 +0100

Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →

#115094

From	Stefan Monnier <monnier@iro.umontreal.ca>
Date	2026-02-22 11:41 -0500
Message-ID	<jwvqzqcd98o.fsf-monnier+comp.arch@gnu.org>
In reply to	#114970

> It is a mathematically very interesting problem.  Whether there is anything
> to be gained by continuing the search for exceptions by brute force, is
> a different matter - but /if/ it is worth doing, then doing so more
> efficiently is a good idea.

AFAIK, in such games "more efficiently" tends to me largely irrelevant
if it's more efficient by a (non-astronomical) constant factor.
Instead, you need it to be algorithmically more efficient.


=== Stefan

[toc] | [prev] | [next] | [standalone]

#115110

From	David Brown <david.brown@hesbynett.no>
Date	2026-02-23 10:53 +0100
Message-ID	<10nh82q$2sbtq$2@dont-email.me>
In reply to	#115094

On 22/02/2026 17:41, Stefan Monnier wrote:
>> It is a mathematically very interesting problem.  Whether there is anything
>> to be gained by continuing the search for exceptions by brute force, is
>> a different matter - but /if/ it is worth doing, then doing so more
>> efficiently is a good idea.
> 
> AFAIK, in such games "more efficiently" tends to me largely irrelevant
> if it's more efficient by a (non-astronomical) constant factor.
> Instead, you need it to be algorithmically more efficient.
> 

Sure - algorithmic improvements usually (but not always in practice) 
lead to the biggest efficiency gains.  But if you are doing a "test 
every number up to N" check of a conjecture, /any/ efficiency 
improvements are helpful.

[toc] | [prev] | [next] | [standalone]

#115114

From	Stefan Monnier <monnier@iro.umontreal.ca>
Date	2026-02-23 10:52 -0500
Message-ID	<jwv4in7a27d.fsf-monnier+comp.arch@gnu.org>
In reply to	#115110

David Brown [2026-02-23 10:53:30] wrote:
> Sure - algorithmic improvements usually (but not always in practice)
> lead to the biggest efficiency gains.  But if you are doing a "test
> every number up to N" check of a conjecture, /any/ efficiency
> improvements are helpful.

But once you get to 2^70, the exponential explosion means that
constant-factor improvements are vanishingly helpful.


=== Stefan

[toc] | [prev] | [next] | [standalone]

#114980

From	Tim Rentsch <tr.17687@z991.linuxsc.com>
Date	2026-02-14 21:19 -0800
Message-ID	<867bseh9c6.fsf@linuxsc.com>
In reply to	#114964

Thomas Koenig <tkoenig@netcologne.de> writes:

> Some people are throwing large amounts of computing power at trying
> to disprove the Collatz conjecture.  Graphics cards have sped this
> up enormously.
>
> (A reminder, if any is needed:  From a starting number n, the
> transformation, recursively applied,
>
>
>    f(n) = 3*n+1, if n is odd; =n/2 if n is even.
>
> is conjectured to always reach 1 for any positive integer 1).
>
> All the work is done on general-purpose hardware, which has many
> capabilities that are not needed, and consume area and power that
> special-purpose hardware would not need.  Also, the hardware
> is limited to 64-bit integers, and the range of tested numbers
> is now up to 2^71, so
>
> What could specialized hardware look like?
>
> The central part could be a unit which is handed a block of numbers
> to check, for example a range of 2^32 numbers.  The current records
> for number of iterations and maximum number reached would also be
> stored in that unit.
>
> In that unit, the calculaiton would be performed by a combination
> of adder and shifter.  The input would always be odd (so the last
> bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).
>
> The adder would compute (3*n+1)/2, and would be specialzed 128-bit
> adder.  Why specialized?  The logic is far less complex than a
> general-purpose adder.  For example, a two-bit adder has the formula
>
> y1 = (!a1&a0) | (!c_in&a1&!a0);
> y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);
>
> and its propagate and generate bits are
>
> p = (a1);
> g = (a1&a0);
>
> The maximum number of trailing zeros after the (3*n+1)/2 step is
> three and can be determined cheaply from bits <3:1> of the input,
> so the input to the shifter (two levels only) can be provided
> in parallel.  The machine would be designed that adding and
> shifting can be done in a single cycle.
>
> The number of iterations would be tracked;  if larger than a prevous
> record, the calculation would terminate with a corresponding signal
> to the outside world.
>
> Also, each cycle, the result from the previous round would be
> compared against the input value.  If it is lower, the next number
> would be chosen.  Finally, the number reached would be checked against
> the previous, pre-set maximum, with would also signal to the outside
> world.
>
> Which numbers to chose?  It is well known that only certain
> numbers with remainders modulo 2^n are interesting.  The number
> can be seen on OEIS A076227.  For example, modulo 2^10, only 64
> numbers are "interesting", or 6.25% (but only eight bits would
> be needed to store them because the last two are always one).
> Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
> amount of numbers to be checked by half.  One would use a small ROM,
> I presume.  The number of bits would be a tradeoff between area and
> amount of compuation.  My current guess would be that 2^10 could
> be a sweet spot.  One third of the numbers would further be
> excluded by checking for divisiblity by three.
>
> The problem is embarassingly parallel.  One could chose a size so
> each parcel is calculated in a few seconds or even minutes.  to avoid
> having to avoid a large amount of communication with the outside.
>
> Because all units would be working all the time, and if one has
> a smaller unit one could simply build more of them, I suspect that
> power would be the main limiting factor.  An adder design that
> might not be very fast, but uses lower power, low voltage and low
> frequency would be beneficial.
>
> Comments?

Amusing.  I predict that ultimately it will be a waste of
effort because the Collatz conjecture is true even though
currently unproven.  Also it seems likely that no mathematical
insights will result from any sort of brute force approach to
disprove it, even if the "brute force" scheme is selective.

If anyone is interested in looking into questions regarding the
Collatz conjecture, I suggest considering a generalization

   f[k](n) = n/2 if n is even, 3n+k if n is odd, for an fixed odd k

Empirically you should quickly find

   There is always a cycle starting at k.  (Duh!)

   There is only one cycle if k is a power of 3.  (This result
   follows from the Collatz conjecture.)

   If k is not a power of 3, there are always at least two cycles,
   and can be more.

[toc] | [prev] | [next] | [standalone]

#114982

From	Robert Finch <robfi680@gmail.com>
Date	2026-02-15 03:58 -0500
Message-ID	<10ms1rm$3s9f4$1@dont-email.me>
In reply to	#114964

On 2026-02-14 4:57 a.m., Thomas Koenig wrote:
> Some people are throwing large amounts of computing power at trying
> to disprove the Collatz conjecture.  Graphics cards have sped this
> up enormously.
> 
> (A reminder, if any is needed:  From a starting number n, the
> transformation, recursively applied,
> 
> 
>     f(n) = 3*n+1, if n is odd; =n/2 if n is even.
> 
> is conjectured to always reach 1 for any positive integer 1).
> 
> All the work is done on general-purpose hardware, which has many
> capabilities that are not needed, and consume area and power that
> special-purpose hardware would not need.  Also, the hardware
> is limited to 64-bit integers, and the range of tested numbers
> is now up to 2^71, so
> 
> What could specialized hardware look like?
> 
> The central part could be a unit which is handed a block of numbers
> to check, for example a range of 2^32 numbers.  The current records
> for number of iterations and maximum number reached would also be
> stored in that unit.
> 
> In that unit, the calculaiton would be performed by a combination
> of adder and shifter.  The input would always be odd (so the last
> bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).
> 
> The adder would compute (3*n+1)/2, and would be specialzed 128-bit
> adder.  Why specialized?  The logic is far less complex than a
> general-purpose adder.  For example, a two-bit adder has the formula
> 
> y1 = (!a1&a0) | (!c_in&a1&!a0);
> y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);
> 
> and its propagate and generate bits are
> 
> p = (a1);
> g = (a1&a0);
> 
> The maximum number of trailing zeros after the (3*n+1)/2 step is
> three and can be determined cheaply from bits <3:1> of the input,
> so the input to the shifter (two levels only) can be provided
> in parallel.  The machine would be designed that adding and
> shifting can be done in a single cycle.
> 
> The number of iterations would be tracked; if larger than a prevous
> record, the calculation would terminate with a corresponding signal
> to the outside world.
> 
> Also, each cycle, the result from the previous round would be
> compared against the input value.  If it is lower, the next number
> would be chosen.  Finally, the number reached would be checked against
> the previous, pre-set maximum, with would also signal to the outside
> world.
> 
> Which numbers to chose?  It is well known that only certain
> numbers with remainders modulo 2^n are interesting.  The number
> can be seen on OEIS A076227.  For example, modulo 2^10, only 64
> numbers are "interesting", or 6.25% (but only eight bits would
> be needed to store them because the last two are always one).
> Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
> amount of numbers to be checked by half.  One would use a small ROM,
> I presume. The number of bits would be a tradeoff between area and
> amount of compuation. My current guess would be that 2^10 could
> be a sweet spot.  One third of the numbers would further be
> excluded by checking for divisiblity by three.
> 
> The problem is embarassingly parallel.  One could chose a size so
> each parcel is calculated in a few seconds or even minutes. to avoid
> having to avoid a large amount of communication with the outside.
> 
> Because all units would be working all the time, and if one has
> a smaller unit one could simply build more of them, I suspect that
> power would be the main limiting factor.  An adder design that
> might not be very fast, but uses lower power, low voltage and low
> frequency would be beneficial.
> 
> Comments?
> 
I wonder how difficult it would be to implement a parallel tester in an 
FPGA. It looks simple enough and there are hundreds of DSP blocks 
available to use. Could test in blocks of hundreds of numbers at a time.
Running at 100 MHz * 200 tests at a time = how long to get to 2^71?

I wonder if the calculation can be broken down further. Numbers are all 
just defined by rules. I wonder if some calculus may apply. Limit of 
function approaches one for a finite number of steps. There may be a 
whole range of algorithms, is there one for the number two? Then the 
number three?

Occasionally I have made useless machinery. Great puzzle work. Made the 
game of life in hardware.

[toc] | [prev] | [next] | [standalone]

#114985

From	David Brown <david.brown@hesbynett.no>
Date	2026-02-15 13:14 +0100
Message-ID	<10msdau$3vgrg$2@dont-email.me>
In reply to	#114982

On 15/02/2026 09:58, Robert Finch wrote:

> I wonder how difficult it would be to implement a parallel tester in an 
> FPGA. It looks simple enough and there are hundreds of DSP blocks 
> available to use. Could test in blocks of hundreds of numbers at a time.
> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?
> 

One of the nice things about this is that you don't need DSP blocks - it 
all works well with very simple operations well-suited to an FPGA.  Your 
calculation blocks would hold an n-bit number (wider than the number you 
are testing for, since rounds of the Collatz function can grow a lot) 
and a counter.  Initialise the block to the number you want to test. 
For each round, check the LSB.  If it is 0, right-shift your register. 
If it is even, replace "x" with "(x << 1) + x".  That's just some simple 
logic - no DSP blocks involved.  If your register or your counter 
overflow, you've found an "interesting" number and can pass that out to 
a PC to test it fully.

There's lots of scope for both high-level work (the housekeeping and 
tracking of operations for each of your worker elements) and low-level 
work (getting the carry operations to work at high speed).

> I wonder if the calculation can be broken down further. Numbers are all 
> just defined by rules. I wonder if some calculus may apply. Limit of 
> function approaches one for a finite number of steps. There may be a 
> whole range of algorithms, is there one for the number two? Then the 
> number three?
> 
> Occasionally I have made useless machinery. Great puzzle work. Made the 
> game of life in hardware.
>

[toc] | [prev] | [next] | [standalone]

#114992

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2026-02-15 15:46 +0000
Message-ID	<10mspoj$4cqn$1@dont-email.me>
In reply to	#114985

David Brown <david.brown@hesbynett.no> schrieb:
> On 15/02/2026 09:58, Robert Finch wrote:
>
>> I wonder how difficult it would be to implement a parallel tester in an 
>> FPGA. It looks simple enough and there are hundreds of DSP blocks 
>> available to use. Could test in blocks of hundreds of numbers at a time.
>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?
>> 
>
> One of the nice things about this is that you don't need DSP blocks - it 
> all works well with very simple operations well-suited to an FPGA.  Your 
> calculation blocks would hold an n-bit number (wider than the number you 
> are testing for, since rounds of the Collatz function can grow a lot) 
> and a counter.  Initialise the block to the number you want to test. 
> For each round, check the LSB.  If it is 0, right-shift your register. 
> If it is even, replace "x" with "(x << 1) + x".  That's just some simple 
> logic - no DSP blocks involved.  If your register or your counter 
> overflow, you've found an "interesting" number and can pass that out to 
> a PC to test it fully.

It is even simpler than that - using the optimization of always
calculating (3*n+1)/2 is ideal for look-up tables.  If you have
six-bit LUTs, you divide your number into five-bit blocks and
set up your LUTs for generating (3*n+1+carry_in)/2 - five LUTs
for five result bits.  You then also need two LUTs for generating
the propagate and generate bits for the five-bit groups.

You can then generate group generate and propagate bits for three
five-bit groups with two LUTs (15 bits) and so on.  Three levels
of P and G would give you 135 bits, more than enough.

The comparators could be more expensive; if this is done with LUTs,
you could only compare three bits each.  Maybe this is better
with the DSP blocks, I don't know.

> There's lots of scope for both high-level work (the housekeeping and 
> tracking of operations for each of your worker elements) and low-level 
> work (getting the carry operations to work at high speed).

Yep.
-- 
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

[toc] | [prev] | [next] | [standalone]

#114993

From	Robert Finch <robfi680@gmail.com>
Date	2026-02-15 10:56 -0500
Message-ID	<10msqc3$4011$1@dont-email.me>
In reply to	#114985

On 2026-02-15 7:14 a.m., David Brown wrote:
> On 15/02/2026 09:58, Robert Finch wrote:
> 
>> I wonder how difficult it would be to implement a parallel tester in 
>> an FPGA. It looks simple enough and there are hundreds of DSP blocks 
>> available to use. Could test in blocks of hundreds of numbers at a time.
>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?
>>
> 
> One of the nice things about this is that you don't need DSP blocks - it 
Yeah, I figured that out after coding. 3n+1 = n+n+n+1

> all works well with very simple operations well-suited to an FPGA.  Your 
> calculation blocks would hold an n-bit number (wider than the number you 
> are testing for, since rounds of the Collatz function can grow a lot) 
> and a counter.  Initialise the block to the number you want to test. For 
> each round, check the LSB.  If it is 0, right-shift your register. If it 
> is even, replace "x" with "(x << 1) + x".  That's just some simple logic 

Oops. I thought that was x = (x >> 1) if it is even. I coded it 
stupidly. It may not matter in FPGA logic. All the bits goo into LUTs 
and it get figured out.

> - no DSP blocks involved.  If your register or your counter overflow, 
> you've found an "interesting" number and can pass that out to a PC to 
> test it fully.
>
I have coded a version that tests up to 8192 number simultaneously. It 
is supposed to be able to run at over 200 MHz. Some of the number would 
take a large number of iterations though.

I found something strange, the code to count for more numbers does not 
increase the size of the core very much. (But it takes longer to build.) 
I am not sure if this occurs because there is a mitsake somewhere, or if 
the tools are able to figure out how to share all the logic. I am 
thinking the logic for adders is shared somehow. 128-bit arithmetic is 
being used.


> There's lots of scope for both high-level work (the housekeeping and 
> tracking of operations for each of your worker elements) and low-level 
> work (getting the carry operations to work at high speed).
> 
Yeah, I am wondering how to display results without using a CPU. I was 
thinking a running indicator of the current number group under test. I 
was going to have it stop after the number exceeds 128-bits.

>> I wonder if the calculation can be broken down further. Numbers are 
>> all just defined by rules. I wonder if some calculus may apply. Limit 
>> of function approaches one for a finite number of steps. There may be 
>> a whole range of algorithms, is there one for the number two? Then the 
>> number three?
>>
>> Occasionally I have made useless machinery. Great puzzle work. Made 
>> the game of life in hardware.
>>
>

[toc] | [prev] | [next] | [standalone]

#114998

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-02-15 19:16 +0000
Message-ID	<1771182989-5857@newsgrouper.org>
In reply to	#114993

Robert Finch <robfi680@gmail.com> posted:

> On 2026-02-15 7:14 a.m., David Brown wrote:
> > On 15/02/2026 09:58, Robert Finch wrote:
> > 
> >> I wonder how difficult it would be to implement a parallel tester in 
> >> an FPGA. It looks simple enough and there are hundreds of DSP blocks 
> >> available to use. Could test in blocks of hundreds of numbers at a time.
> >> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?
> >>
> > 
> > One of the nice things about this is that you don't need DSP blocks - it 
> Yeah, I figured that out after coding. 3n+1 = n+n+n+1

3n+1 = (n<<1)+n+1

one add with carry in and n gets wired in directly and also shifted
up by 1 (no logic shifter).

[toc] | [prev] | [next] | [standalone]

#115001

From	Terje Mathisen <terje.mathisen@tmsw.no>
Date	2026-02-15 22:40 +0100
Message-ID	<10mteg5$c68b$1@dont-email.me>
In reply to	#114998

MitchAlsup wrote:
> 
> Robert Finch <robfi680@gmail.com> posted:
> 
>> On 2026-02-15 7:14 a.m., David Brown wrote:
>>> On 15/02/2026 09:58, Robert Finch wrote:
>>>
>>>> I wonder how difficult it would be to implement a parallel tester in
>>>> an FPGA. It looks simple enough and there are hundreds of DSP blocks
>>>> available to use. Could test in blocks of hundreds of numbers at a time.
>>>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?
>>>>
>>>
>>> One of the nice things about this is that you don't need DSP blocks - it
>> Yeah, I figured that out after coding. 3n+1 = n+n+n+1
> 
> 3n+1 = (n<<1)+n+1
> 
> one add with carry in and n gets wired in directly and also shifted
> up by 1 (no logic shifter).

Even better:

(3n+1)/2 = n + (n>>1) + 1

I.e two iterations for the cost of one.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#115154

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2026-02-26 19:31 +0000
Message-ID	<10nq728$1tcqi$1@dont-email.me>
In reply to	#115001

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> MitchAlsup wrote:
>> 
>> Robert Finch <robfi680@gmail.com> posted:
>> 
>>> On 2026-02-15 7:14 a.m., David Brown wrote:
>>>> On 15/02/2026 09:58, Robert Finch wrote:
>>>>
>>>>> I wonder how difficult it would be to implement a parallel tester in
>>>>> an FPGA. It looks simple enough and there are hundreds of DSP blocks
>>>>> available to use. Could test in blocks of hundreds of numbers at a time.
>>>>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?
>>>>>
>>>>
>>>> One of the nice things about this is that you don't need DSP blocks - it
>>> Yeah, I figured that out after coding. 3n+1 = n+n+n+1
>> 
>> 3n+1 = (n<<1)+n+1
>> 
>> one add with carry in and n gets wired in directly and also shifted
>> up by 1 (no logic shifter).
>
> Even better:
>
> (3n+1)/2 = n + (n>>1) + 1
>
> I.e two iterations for the cost of one.

For b = n + (n>>1) + 1, for odd n, a bit of a micro-optimization is
possible because lowest-order bit of the result equals bit number
1 from n.

-- 
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

[toc] | [prev] | [next] | [standalone]

#115034

From	Michael S <already5chosen@yahoo.com>
Date	2026-02-18 18:22 +0200
Message-ID	<20260218182251.00004e44@yahoo.com>
In reply to	#114993

On Sun, 15 Feb 2026 10:56:51 -0500
Robert Finch <robfi680@gmail.com> wrote:

> I have coded a version that tests up to 8192 number simultaneously.
> It is supposed to be able to run at over 200 MHz. Some of the number
> would take a large number of iterations though.
> 
> I found something strange, the code to count for more numbers does
> not increase the size of the core very much. (But it takes longer to
> build.) I am not sure if this occurs because there is a mitsake
> somewhere, or if the tools are able to figure out how to share all
> the logic. I am thinking the logic for adders is shared somehow.
> 128-bit arithmetic is being used.
> 

There is a mistake in your code. No other explanation is possible.

You should look for 3 sorts of mistakes
1. Core itself.
The most likely mistake is failing to look for one of the ending
conditions. You have to look for 3 conditions:
a) - value produced at current step is smaller than initial value. 
 That is successful ending. Here you signal to the scheduler that you
 are ready to accept the next value.
b) - overflow. Next value exceeds the width of register.
c) - timeout. You were running for more than N steps. N=8192 looks
like the reasonable value.
a) and b) are suspected failures.
The handling for b) and c) is similar - you raise alert signal and stop.
Your test unit has to provide supervisor circuity with the mean to
fetch the initial value that caused suspected failure and to clear
alert condition. 
After alert is cleared your unit returns to the same state as after
success - waiting for next initial value.

2. Scheduler. 
In proof-of-concept design the scheduler does not have to be very
realistic. For example, it does not have to be efficiently pipelined.
But it should be able to monitor ready signals of all test units and to
supply different initial data to different units.
Supplying the same data to all units simultaneously is a sort of
mistake that explains your observed behavior.

3. Supervisor
Again, in proof-of-concept design the supervisor does not have to be
very realistic. But it should be able to selectively exercise all core
feature related to alert, including fetching of initial data and
clearing of alert.

After you coded everything correctly, do not even try to compile test
with 8000 units. That is totally unrealistic. Even if you find huge
FPGA in which such number of testers could fit (and I am not sure that
such FPGA exists) the compilation will take many hours, possibly days.

[toc] | [prev] | [next] | [standalone]

#115038

From	Robert Finch <robfi680@gmail.com>
Date	2026-02-18 22:56 -0500
Message-ID	<10n61l5$36tk6$1@dont-email.me>
In reply to	#115034

On 2026-02-18 11:22 a.m., Michael S wrote:
> On Sun, 15 Feb 2026 10:56:51 -0500
> Robert Finch <robfi680@gmail.com> wrote:
> 
>> I have coded a version that tests up to 8192 number simultaneously.
>> It is supposed to be able to run at over 200 MHz. Some of the number
>> would take a large number of iterations though.
>>
>> I found something strange, the code to count for more numbers does
>> not increase the size of the core very much. (But it takes longer to
>> build.) I am not sure if this occurs because there is a mitsake
>> somewhere, or if the tools are able to figure out how to share all
>> the logic. I am thinking the logic for adders is shared somehow.
>> 128-bit arithmetic is being used.
>>
> 
> There is a mistake in your code. No other explanation is possible.
> 
> You should look for 3 sorts of mistakes
> 1. Core itself.
> The most likely mistake is failing to look for one of the ending
> conditions. You have to look for 3 conditions:
> a) - value produced at current step is smaller than initial value.
>   That is successful ending. Here you signal to the scheduler that you
>   are ready to accept the next value.
> b) - overflow. Next value exceeds the width of register.
> c) - timeout. You were running for more than N steps. N=8192 looks
> like the reasonable value.
> a) and b) are suspected failures.
> The handling for b) and c) is similar - you raise alert signal and stop.
> Your test unit has to provide supervisor circuity with the mean to
> fetch the initial value that caused suspected failure and to clear
> alert condition.
> After alert is cleared your unit returns to the same state as after
> success - waiting for next initial value.

I found the mistake. Core was working, but groups of cores were not. 
Done signals were missing.

> 
> 2. Scheduler.
> In proof-of-concept design the scheduler does not have to be very
> realistic. For example, it does not have to be efficiently pipelined.
> But it should be able to monitor ready signals of all test units and to
> supply different initial data to different units.
> Supplying the same data to all units simultaneously is a sort of
> mistake that explains your observed behavior.
> 
There is really simple scheduling ATM. There is just a bunch of done 
signals that are gated together. Meaning the slowest calculation of the 
group will slow everything else down. Once all the dones are true the 
next set of numbers are loaded.

> 3. Supervisor
> Again, in proof-of-concept design the supervisor does not have to be
> very realistic. But it should be able to selectively exercise all core
> feature related to alert, including fetching of initial data and
> clearing of alert.
> 
> After you coded everything correctly, do not even try to compile test
> with 8000 units. That is totally unrealistic. Even if you find huge
> FPGA in which such number of testers could fit (and I am not sure that
> such FPGA exists) the compilation will take many hours, possibly days.
> 
> 
> 
Yeah 8,000 was definitely out of the picture. I had since coded 24x24 
matrix (576) and although the logic would fit that turned out not to be 
able to route. So, it is down to 20x20. 20x20 with no real scheduling fits.

I tried displaying progress as an on-screen bitmap, but that just 
appeared noisy. I have been working on an SCP so may be able to use that 
to manage the scheduling with some custom logic.

[toc] | [prev] | [next] | [standalone]

#115041

From	Michael S <already5chosen@yahoo.com>
Date	2026-02-19 12:10 +0200
Message-ID	<20260219121041.0000505b@yahoo.com>
In reply to	#115038

On Wed, 18 Feb 2026 22:56:19 -0500
Robert Finch <robfi680@gmail.com> wrote:

> On 2026-02-18 11:22 a.m., Michael S wrote:
> > On Sun, 15 Feb 2026 10:56:51 -0500
> > Robert Finch <robfi680@gmail.com> wrote:
> >   
> >> I have coded a version that tests up to 8192 number simultaneously.
> >> It is supposed to be able to run at over 200 MHz. Some of the
> >> number would take a large number of iterations though.
> >>
> >> I found something strange, the code to count for more numbers does
> >> not increase the size of the core very much. (But it takes longer
> >> to build.) I am not sure if this occurs because there is a mitsake
> >> somewhere, or if the tools are able to figure out how to share all
> >> the logic. I am thinking the logic for adders is shared somehow.
> >> 128-bit arithmetic is being used.
> >>  
> > 
> > There is a mistake in your code. No other explanation is possible.
> > 
> > You should look for 3 sorts of mistakes
> > 1. Core itself.
> > The most likely mistake is failing to look for one of the ending
> > conditions. You have to look for 3 conditions:
> > a) - value produced at current step is smaller than initial value.
> >   That is successful ending. Here you signal to the scheduler that
> > you are ready to accept the next value.
> > b) - overflow. Next value exceeds the width of register.
> > c) - timeout. You were running for more than N steps. N=8192 looks
> > like the reasonable value.
> > a) and b) are suspected failures.
> > The handling for b) and c) is similar - you raise alert signal and
> > stop. Your test unit has to provide supervisor circuity with the
> > mean to fetch the initial value that caused suspected failure and
> > to clear alert condition.
> > After alert is cleared your unit returns to the same state as after
> > success - waiting for next initial value.  
> 
> I found the mistake. Core was working, but groups of cores were not. 
> Done signals were missing.
> 
> > 
> > 2. Scheduler.
> > In proof-of-concept design the scheduler does not have to be very
> > realistic. For example, it does not have to be efficiently
> > pipelined. But it should be able to monitor ready signals of all
> > test units and to supply different initial data to different units.
> > Supplying the same data to all units simultaneously is a sort of
> > mistake that explains your observed behavior.
> >   
> There is really simple scheduling ATM. There is just a bunch of done 
> signals that are gated together. Meaning the slowest calculation of
> the group will slow everything else down. Once all the dones are true
> the next set of numbers are loaded.
> 
> > 3. Supervisor
> > Again, in proof-of-concept design the supervisor does not have to be
> > very realistic. But it should be able to selectively exercise all
> > core feature related to alert, including fetching of initial data
> > and clearing of alert.
> > 
> > After you coded everything correctly, do not even try to compile
> > test with 8000 units. That is totally unrealistic. Even if you find
> > huge FPGA in which such number of testers could fit (and I am not
> > sure that such FPGA exists) the compilation will take many hours,
> > possibly days.
> > 
> > 
> >   
> Yeah 8,000 was definitely out of the picture. I had since coded 24x24 
> matrix (576) and although the logic would fit that turned out not to
> be able to route. So, it is down to 20x20. 20x20 with no real
> scheduling fits.
> 
> I tried displaying progress as an on-screen bitmap, but that just 
> appeared noisy. I have been working on an SCP so may be able to use
> that to manage the scheduling with some custom logic.
> 
> 
> 

Does 20x20 means 400 cores with something like 80-bit initial values
and 128-bit processing?
If you didn't make further mistakes then it should be rather big, order
of 300K LEs. It would not fit in the biggest low-end Altera device
(Cyclone 10GX 10CX220).
On the Xilinx side, it would not fit in any Spartan and in most Artix
devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P).
But more realistically for 400 test units one would want Arria 10 on
Altera side or Kintex on Xilinx side. Both are not prohibitively
expensive, but not cheap.

What device are you using?
If it is smaller than Kintex KU5P then I'd strongly suspect that you
didn't clean out all mistakes.

[toc] | [prev] | [next] | [standalone]

#115066

From	Robert Finch <robfi680@gmail.com>
Date	2026-02-21 01:56 -0500
Message-ID	<10nbku5$11ko8$1@dont-email.me>
In reply to	#115041

On 2026-02-19 5:10 a.m., Michael S wrote:
> On Wed, 18 Feb 2026 22:56:19 -0500
> Robert Finch <robfi680@gmail.com> wrote:
> 
>> On 2026-02-18 11:22 a.m., Michael S wrote:
>>> On Sun, 15 Feb 2026 10:56:51 -0500
>>> Robert Finch <robfi680@gmail.com> wrote:
>>>    
>>>> I have coded a version that tests up to 8192 number simultaneously.
>>>> It is supposed to be able to run at over 200 MHz. Some of the
>>>> number would take a large number of iterations though.
>>>>
>>>> I found something strange, the code to count for more numbers does
>>>> not increase the size of the core very much. (But it takes longer
>>>> to build.) I am not sure if this occurs because there is a mitsake
>>>> somewhere, or if the tools are able to figure out how to share all
>>>> the logic. I am thinking the logic for adders is shared somehow.
>>>> 128-bit arithmetic is being used.
>>>>   
>>>
>>> There is a mistake in your code. No other explanation is possible.
>>>
>>> You should look for 3 sorts of mistakes
>>> 1. Core itself.
>>> The most likely mistake is failing to look for one of the ending
>>> conditions. You have to look for 3 conditions:
>>> a) - value produced at current step is smaller than initial value.
>>>    That is successful ending. Here you signal to the scheduler that
>>> you are ready to accept the next value.
>>> b) - overflow. Next value exceeds the width of register.
>>> c) - timeout. You were running for more than N steps. N=8192 looks
>>> like the reasonable value.
>>> a) and b) are suspected failures.
>>> The handling for b) and c) is similar - you raise alert signal and
>>> stop. Your test unit has to provide supervisor circuity with the
>>> mean to fetch the initial value that caused suspected failure and
>>> to clear alert condition.
>>> After alert is cleared your unit returns to the same state as after
>>> success - waiting for next initial value.
>>
>> I found the mistake. Core was working, but groups of cores were not.
>> Done signals were missing.
>>
>>>
>>> 2. Scheduler.
>>> In proof-of-concept design the scheduler does not have to be very
>>> realistic. For example, it does not have to be efficiently
>>> pipelined. But it should be able to monitor ready signals of all
>>> test units and to supply different initial data to different units.
>>> Supplying the same data to all units simultaneously is a sort of
>>> mistake that explains your observed behavior.
>>>    
>> There is really simple scheduling ATM. There is just a bunch of done
>> signals that are gated together. Meaning the slowest calculation of
>> the group will slow everything else down. Once all the dones are true
>> the next set of numbers are loaded.
>>
>>> 3. Supervisor
>>> Again, in proof-of-concept design the supervisor does not have to be
>>> very realistic. But it should be able to selectively exercise all
>>> core feature related to alert, including fetching of initial data
>>> and clearing of alert.
>>>
>>> After you coded everything correctly, do not even try to compile
>>> test with 8000 units. That is totally unrealistic. Even if you find
>>> huge FPGA in which such number of testers could fit (and I am not
>>> sure that such FPGA exists) the compilation will take many hours,
>>> possibly days.
>>>
>>>
>>>    
>> Yeah 8,000 was definitely out of the picture. I had since coded 24x24
>> matrix (576) and although the logic would fit that turned out not to
>> be able to route. So, it is down to 20x20. 20x20 with no real
>> scheduling fits.
>>
>> I tried displaying progress as an on-screen bitmap, but that just
>> appeared noisy. I have been working on an SCP so may be able to use
>> that to manage the scheduling with some custom logic.
>>
>>
>>
> 
> Does 20x20 means 400 cores with something like 80-bit initial values
> and 128-bit processing?
Yes. 128-bit processing.

> If you didn't make further mistakes then it should be rather big, order
> of 300K LEs. It would not fit in the biggest low-end Altera device
> (Cyclone 10GX 10CX220).

IIRC it is about 150k LUTs. It uses the simplest approach, which is not 
very fast iteration wise, but it allows a lot of cores and a fast clock 
cycles time. One core is very small and has a fast iteration (1 clock 
cycle). But it takes a lot more iterations than can be done with a more 
brainiac approach.

Using a more brainiac approach would likely cut the performance in half 
and use a lot more LUTs.

> On the Xilinx side, it would not fit in any Spartan and in most Artix
> devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P).
> But more realistically for 400 test units one would want Arria 10 on
> Altera side or Kintex on Xilinx side. Both are not prohibitively
> expensive, but not cheap.
> 
> What device are you using?
I am not using an too inexpensive device. I have a Kintex-325 with about 
200k LUTs.

> If it is smaller than Kintex KU5P then I'd strongly suspect that you
> didn't clean out all mistakes.
> 
> 
> 
There could be more mistakes for sure. But I am sure the lowest level 
core works. It seemed to in simulation. I made a slight mod to it since 
I tested it last, checking n < startn.
Counts are not used so the logic would be stripped out, but its good for 
verification.

1 Core is:

module Collatz_conjecture(rst, clk, startn, count, done);
input rst;
input clk;
input [127:0] startn;
output reg [127:0] count;
output reg done;
reg [127:0] n;

always_ff @(posedge clk)
if (rst) begin
	n <= startn;
	count <= 0;
	done <= FALSE;
end
else begin
	if (!done) begin
		if (~n[0])
			n <= n >> 2'd1;
		else
			n <= n+n+n+1;
		if (n < startn)
			done <= TRUE;
		count <= count + 1;
	end
end

endmodule

[toc] | [prev] | [next] | [standalone]

#115068

From	Michael S <already5chosen@yahoo.com>
Date	2026-02-21 19:23 +0200
Message-ID	<20260221192302.000030d9@yahoo.com>
In reply to	#115066

On Sat, 21 Feb 2026 01:56:03 -0500
Robert Finch <robfi680@gmail.com> wrote:

> On 2026-02-19 5:10 a.m., Michael S wrote:
> > On Wed, 18 Feb 2026 22:56:19 -0500
> > Robert Finch <robfi680@gmail.com> wrote:
> >   
> >> On 2026-02-18 11:22 a.m., Michael S wrote:  
> >>> On Sun, 15 Feb 2026 10:56:51 -0500
> >>> Robert Finch <robfi680@gmail.com> wrote:
> >>>      
> >>>> I have coded a version that tests up to 8192 number
> >>>> simultaneously. It is supposed to be able to run at over 200
> >>>> MHz. Some of the number would take a large number of iterations
> >>>> though.
> >>>>
> >>>> I found something strange, the code to count for more numbers
> >>>> does not increase the size of the core very much. (But it takes
> >>>> longer to build.) I am not sure if this occurs because there is
> >>>> a mitsake somewhere, or if the tools are able to figure out how
> >>>> to share all the logic. I am thinking the logic for adders is
> >>>> shared somehow. 128-bit arithmetic is being used.
> >>>>     
> >>>
> >>> There is a mistake in your code. No other explanation is possible.
> >>>
> >>> You should look for 3 sorts of mistakes
> >>> 1. Core itself.
> >>> The most likely mistake is failing to look for one of the ending
> >>> conditions. You have to look for 3 conditions:
> >>> a) - value produced at current step is smaller than initial value.
> >>>    That is successful ending. Here you signal to the scheduler
> >>> that you are ready to accept the next value.
> >>> b) - overflow. Next value exceeds the width of register.
> >>> c) - timeout. You were running for more than N steps. N=8192 looks
> >>> like the reasonable value.
> >>> a) and b) are suspected failures.
> >>> The handling for b) and c) is similar - you raise alert signal and
> >>> stop. Your test unit has to provide supervisor circuity with the
> >>> mean to fetch the initial value that caused suspected failure and
> >>> to clear alert condition.
> >>> After alert is cleared your unit returns to the same state as
> >>> after success - waiting for next initial value.  
> >>
> >> I found the mistake. Core was working, but groups of cores were
> >> not. Done signals were missing.
> >>  
> >>>
> >>> 2. Scheduler.
> >>> In proof-of-concept design the scheduler does not have to be very
> >>> realistic. For example, it does not have to be efficiently
> >>> pipelined. But it should be able to monitor ready signals of all
> >>> test units and to supply different initial data to different
> >>> units. Supplying the same data to all units simultaneously is a
> >>> sort of mistake that explains your observed behavior.
> >>>      
> >> There is really simple scheduling ATM. There is just a bunch of
> >> done signals that are gated together. Meaning the slowest
> >> calculation of the group will slow everything else down. Once all
> >> the dones are true the next set of numbers are loaded.
> >>  
> >>> 3. Supervisor
> >>> Again, in proof-of-concept design the supervisor does not have to
> >>> be very realistic. But it should be able to selectively exercise
> >>> all core feature related to alert, including fetching of initial
> >>> data and clearing of alert.
> >>>
> >>> After you coded everything correctly, do not even try to compile
> >>> test with 8000 units. That is totally unrealistic. Even if you
> >>> find huge FPGA in which such number of testers could fit (and I
> >>> am not sure that such FPGA exists) the compilation will take many
> >>> hours, possibly days.
> >>>
> >>>
> >>>      
> >> Yeah 8,000 was definitely out of the picture. I had since coded
> >> 24x24 matrix (576) and although the logic would fit that turned
> >> out not to be able to route. So, it is down to 20x20. 20x20 with
> >> no real scheduling fits.
> >>
> >> I tried displaying progress as an on-screen bitmap, but that just
> >> appeared noisy. I have been working on an SCP so may be able to use
> >> that to manage the scheduling with some custom logic.
> >>
> >>
> >>  
> > 
> > Does 20x20 means 400 cores with something like 80-bit initial values
> > and 128-bit processing?  
> Yes. 128-bit processing.
> 
> > If you didn't make further mistakes then it should be rather big,
> > order of 300K LEs. It would not fit in the biggest low-end Altera
> > device (Cyclone 10GX 10CX220).  
> 
> IIRC it is about 150k LUTs. It uses the simplest approach, which is
> not very fast iteration wise, but it allows a lot of cores and a fast
> clock cycles time. One core is very small and has a fast iteration (1
> clock cycle). But it takes a lot more iterations than can be done
> with a more brainiac approach.
> 
> Using a more brainiac approach would likely cut the performance in
> half and use a lot more LUTs.
> 
> > On the Xilinx side, it would not fit in any Spartan and in most
> > Artix devices. May be, it would fit in the biggest Artix
> > UltraScale+ (AU25P). But more realistically for 400 test units one
> > would want Arria 10 on Altera side or Kintex on Xilinx side. Both
> > are not prohibitively expensive, but not cheap.
> > 
> > What device are you using?  
> I am not using an too inexpensive device. I have a Kintex-325 with
> about 200k LUTs.
> 
> > If it is smaller than Kintex KU5P then I'd strongly suspect that you
> > didn't clean out all mistakes.
> > 
> > 
> >   
> There could be more mistakes for sure. But I am sure the lowest level 
> core works. It seemed to in simulation. I made a slight mod to it
> since I tested it last, checking n < startn.
> Counts are not used so the logic would be stripped out, but its good
> for verification.
> 
> 1 Core is:
> 
> module Collatz_conjecture(rst, clk, startn, count, done);
> input rst;
> input clk;
> input [127:0] startn;
> output reg [127:0] count;
> output reg done;
> reg [127:0] n;
> 
> always_ff @(posedge clk)
> if (rst) begin
> 	n <= startn;
> 	count <= 0;
> 	done <= FALSE;
> end
> else begin
> 	if (!done) begin
> 		if (~n[0])
> 			n <= n >> 2'd1;
> 		else
> 			n <= n+n+n+1;
> 		if (n < startn)
> 			done <= TRUE;
> 		count <= count + 1;
> 	end
> end
> 
> endmodule
> 
>

[toc] | [prev] | [next] | [standalone]

#115070

From	Michael S <already5chosen@yahoo.com>
Date	2026-02-21 20:36 +0200
Message-ID	<20260221203651.00000b57@yahoo.com>
In reply to	#115066

On Sat, 21 Feb 2026 01:56:03 -0500
Robert Finch <robfi680@gmail.com> wrote:

> On 2026-02-19 5:10 a.m., Michael S wrote:
> > On Wed, 18 Feb 2026 22:56:19 -0500
> > Robert Finch <robfi680@gmail.com> wrote:
> >   
> >> On 2026-02-18 11:22 a.m., Michael S wrote:  
> >>> On Sun, 15 Feb 2026 10:56:51 -0500
> >>> Robert Finch <robfi680@gmail.com> wrote:
> >>>      
> >>>> I have coded a version that tests up to 8192 number
> >>>> simultaneously. It is supposed to be able to run at over 200
> >>>> MHz. Some of the number would take a large number of iterations
> >>>> though.
> >>>>
> >>>> I found something strange, the code to count for more numbers
> >>>> does not increase the size of the core very much. (But it takes
> >>>> longer to build.) I am not sure if this occurs because there is
> >>>> a mitsake somewhere, or if the tools are able to figure out how
> >>>> to share all the logic. I am thinking the logic for adders is
> >>>> shared somehow. 128-bit arithmetic is being used.
> >>>>     
> >>>
> >>> There is a mistake in your code. No other explanation is possible.
> >>>
> >>> You should look for 3 sorts of mistakes
> >>> 1. Core itself.
> >>> The most likely mistake is failing to look for one of the ending
> >>> conditions. You have to look for 3 conditions:
> >>> a) - value produced at current step is smaller than initial value.
> >>>    That is successful ending. Here you signal to the scheduler
> >>> that you are ready to accept the next value.
> >>> b) - overflow. Next value exceeds the width of register.
> >>> c) - timeout. You were running for more than N steps. N=8192 looks
> >>> like the reasonable value.
> >>> a) and b) are suspected failures.
> >>> The handling for b) and c) is similar - you raise alert signal and
> >>> stop. Your test unit has to provide supervisor circuity with the
> >>> mean to fetch the initial value that caused suspected failure and
> >>> to clear alert condition.
> >>> After alert is cleared your unit returns to the same state as
> >>> after success - waiting for next initial value.  
> >>
> >> I found the mistake. Core was working, but groups of cores were
> >> not. Done signals were missing.
> >>  
> >>>
> >>> 2. Scheduler.
> >>> In proof-of-concept design the scheduler does not have to be very
> >>> realistic. For example, it does not have to be efficiently
> >>> pipelined. But it should be able to monitor ready signals of all
> >>> test units and to supply different initial data to different
> >>> units. Supplying the same data to all units simultaneously is a
> >>> sort of mistake that explains your observed behavior.
> >>>      
> >> There is really simple scheduling ATM. There is just a bunch of
> >> done signals that are gated together. Meaning the slowest
> >> calculation of the group will slow everything else down. Once all
> >> the dones are true the next set of numbers are loaded.
> >>  
> >>> 3. Supervisor
> >>> Again, in proof-of-concept design the supervisor does not have to
> >>> be very realistic. But it should be able to selectively exercise
> >>> all core feature related to alert, including fetching of initial
> >>> data and clearing of alert.
> >>>
> >>> After you coded everything correctly, do not even try to compile
> >>> test with 8000 units. That is totally unrealistic. Even if you
> >>> find huge FPGA in which such number of testers could fit (and I
> >>> am not sure that such FPGA exists) the compilation will take many
> >>> hours, possibly days.
> >>>
> >>>
> >>>      
> >> Yeah 8,000 was definitely out of the picture. I had since coded
> >> 24x24 matrix (576) and although the logic would fit that turned
> >> out not to be able to route. So, it is down to 20x20. 20x20 with
> >> no real scheduling fits.
> >>
> >> I tried displaying progress as an on-screen bitmap, but that just
> >> appeared noisy. I have been working on an SCP so may be able to use
> >> that to manage the scheduling with some custom logic.
> >>
> >>
> >>  
> > 
> > Does 20x20 means 400 cores with something like 80-bit initial values
> > and 128-bit processing?  
> Yes. 128-bit processing.
> 
> > If you didn't make further mistakes then it should be rather big,
> > order of 300K LEs. It would not fit in the biggest low-end Altera
> > device (Cyclone 10GX 10CX220).  
> 
> IIRC it is about 150k LUTs. It uses the simplest approach, which is
> not very fast iteration wise, but it allows a lot of cores and a fast
> clock cycles time. One core is very small and has a fast iteration (1
> clock cycle). But it takes a lot more iterations than can be done
> with a more brainiac approach.
> 
> Using a more brainiac approach would likely cut the performance in
> half and use a lot more LUTs.
> 

According to my experiments, combining 2 steps have very small negative
effect on achivable clock and increases area by ~1.8x. So, to me it
looks like a win.
That's the combined step I am talking about:
  x = n/4;
  switch (x % 4) {
    case 0: n = x; break;
    case 1: n = 3*x+1; break;
    case 2: n = 2*x+2; break;
    case 3: n = 9*x+8; break;
  }



> > On the Xilinx side, it would not fit in any Spartan and in most
> > Artix devices. May be, it would fit in the biggest Artix
> > UltraScale+ (AU25P). But more realistically for 400 test units one
> > would want Arria 10 on Altera side or Kintex on Xilinx side. Both
> > are not prohibitively expensive, but not cheap.
> > 
> > What device are you using?  
> I am not using an too inexpensive device. I have a Kintex-325 with
> about 200k LUTs.
> 

According to Xilinx marketing materials XC7K325T has 326080 equivalents
of traditional LEs.
That explains why your design fits.

Pay attention that while Kintex evaluation boards could be
non-expensive, right now buying devices for actual production work is
rather costly, especially in small quantities.
The lowest price I see on digikey site is 1147 USD. That's a bad value
relatively to many other FPGAs.


> > If it is smaller than Kintex KU5P then I'd strongly suspect that you
> > didn't clean out all mistakes.
> > 
> > 
> >   
> There could be more mistakes for sure. But I am sure the lowest level 
> core works. It seemed to in simulation. I made a slight mod to it
> since I tested it last, checking n < startn.
> Counts are not used so the logic would be stripped out, but its good
> for verification.
> 
> 1 Core is:
> 
> module Collatz_conjecture(rst, clk, startn, count, done);
> input rst;
> input clk;
> input [127:0] startn;
> output reg [127:0] count;
> output reg done;
> reg [127:0] n;
> 
> always_ff @(posedge clk)
> if (rst) begin
> 	n <= startn;
> 	count <= 0;
> 	done <= FALSE;
> end
> else begin
> 	if (!done) begin
> 		if (~n[0])
> 			n <= n >> 2'd1;
> 		else
> 			n <= n+n+n+1;
> 		if (n < startn)
> 			done <= TRUE;
> 		count <= count + 1;
> 	end
> end
> 
> endmodule
> 


Ok.
1. Your core misses one critical part - overflow check.
Assuming 80-bit initial count, 128-bit register will overflow
eventually. It would happen very rarely, but it would happen.
The check for overflow does not have to be precise. Conservative
approach, like looking at 2 MS bits and one LS bit and signaling
overflow not when it really happened but when it *could* happen, i.e.
when MS bits > 0 and LS bit='1' is good enough. But the check can't be
omitted.

2. I see that instead of calculating n = (n*3+1)/2 as one step you are
doing it in two separate steps. I am not sure that it is really makes
your core smaller or faster. And it certainly increases # of iterations
by 1/3rd.

3.
128-bit count is pointless loss of resources. 12 or 13 bit are
sufficient. If the count exceeded that size then you already found
something of high interest and it is a job of supervisor to find out
what exactly had happened.
128-bit counter actually not just too big, it is also pointless,
because if we pretend to believe that our purpose is to disprove
Collatz's conjecture then supervisor would have to implement stop
condition with some sort of timeout that will effectively duplicate
functionality of the counter.

4.
Your module does not latch the value of startn. It means that caller
module has to do it on it's behalf. That is the most likely place where
one can make a mistake.

5.
Coding style comment. Normally, name rst is used for asynchronous
reset signal. In your design rst is a synchronous signal. For the
benefit of casual reader I suggest to find different name.

[toc] | [prev] | [next] | [standalone]

#115084

From	Michael S <already5chosen@yahoo.com>
Date	2026-02-22 10:46 +0200
Message-ID	<20260222104600.0000372b@yahoo.com>
In reply to	#115070

On Sat, 21 Feb 2026 20:36:51 +0200
Michael S <already5chosen@yahoo.com> wrote:

> > 
> > Using a more brainiac approach would likely cut the performance in
> > half and use a lot more LUTs.
> >   
> 
> According to my experiments, combining 2 steps have very small
> negative effect on achivable clock and increases area by ~1.8x. So,
> to me it looks like a win.
> That's the combined step I am talking about:
>   x = n/4;
>   switch (x % 4) {
>     case 0: n = x; break;
>     case 1: n = 3*x+1; break;
>     case 2: n = 2*x+2; break;
>     case 3: n = 9*x+8; break;
>   }
> 
> 
> 

Mistake in C above. Should be 
 switch (n % 4) {

[toc] | [prev] | [next] | [standalone]

#115087

From	Terje Mathisen <terje.mathisen@tmsw.no>
Date	2026-02-22 12:45 +0100
Message-ID	<10neq97$22qd5$1@dont-email.me>
In reply to	#115084

Michael S wrote:
> On Sat, 21 Feb 2026 20:36:51 +0200
> Michael S <already5chosen@yahoo.com> wrote:
> 
>>>
>>> Using a more brainiac approach would likely cut the performance in
>>> half and use a lot more LUTs.
>>>    
>>
>> According to my experiments, combining 2 steps have very small
>> negative effect on achivable clock and increases area by ~1.8x. So,
>> to me it looks like a win.
>> That's the combined step I am talking about:
>>    x = n/4;
>>    switch (x % 4) {
>>      case 0: n = x; break;
>>      case 1: n = 3*x+1; break;
>>      case 2: n = 2*x+2; break;
>>      case 3: n = 9*x+8; break;
>>    }
>>
>>
>>
> 
> Mistake in C above. Should be
>   switch (n % 4) {

I did notice that. :-)

The one thing that worries me (sw on a 64-bit platform) about the code 
is the 9* on 128-bit variables:

  9*x+8 =>

Do we use SHLD + SHL here or something else?

How about MUL & LEA?

;; Input in r10:r9, output in rdx:rax
mov rax,r9
mul rax,rbx	;; RBX == 9
lea r10,[r10+r10*8]
add rdx,r10

That looks like 5-6 clock cycles, so the branch misses from the switch 
statement would probably dominate unless you do as I suggested and use 
lookup tables instead:

   let bot2 = n & 3;
   let x = n >> 2;
   n = x*multab[bot2] + addtab[bot2];

but if we do that, then (at least for a sw implementation) it would be 
better to pick a lot more of the LS bits, at least 8-12?

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#115092

From	Michael S <already5chosen@yahoo.com>
Date	2026-02-22 16:35 +0200
Message-ID	<20260222163502.000026d4@yahoo.com>
In reply to	#115087

On Sun, 22 Feb 2026 12:45:44 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Michael S wrote:
> > On Sat, 21 Feb 2026 20:36:51 +0200
> > Michael S <already5chosen@yahoo.com> wrote:
> >   
> >>>
> >>> Using a more brainiac approach would likely cut the performance in
> >>> half and use a lot more LUTs.
> >>>      
> >>
> >> According to my experiments, combining 2 steps have very small
> >> negative effect on achivable clock and increases area by ~1.8x. So,
> >> to me it looks like a win.
> >> That's the combined step I am talking about:
> >>    x = n/4;
> >>    switch (x % 4) {
> >>      case 0: n = x; break;
> >>      case 1: n = 3*x+1; break;
> >>      case 2: n = 2*x+2; break;
> >>      case 3: n = 9*x+8; break;
> >>    }
> >>
> >>
> >>  
> > 
> > Mistake in C above. Should be
> >   switch (n % 4) {  
> 
> I did notice that. :-)
> 
> The one thing that worries me (sw on a 64-bit platform) about the
> code is the 9* on 128-bit variables:
> 
>   9*x+8 =>
> 
> Do we use SHLD + SHL here or something else?
> 
> How about MUL & LEA?
> 
> ;; Input in r10:r9, output in rdx:rax
> mov rax,r9
> mul rax,rbx	;; RBX == 9
> lea r10,[r10+r10*8]
> add rdx,r10
> 
> That looks like 5-6 clock cycles, so the branch misses from the
> switch statement would probably dominate unless you do as I suggested
> and use lookup tables instead:
> 
>    let bot2 = n & 3;
>    let x = n >> 2;
>    n = x*multab[bot2] + addtab[bot2];
> 
> but if we do that, then (at least for a sw implementation) it would
> be better to pick a lot more of the LS bits, at least 8-12?
> 
> Terje
> 

This variant is not intended for software implementation.
For software it is obviously sub-optimal for the reasons you suggested.
OTOH, for FPGA it seems to work a little better than both simpler and
more complicated steps.

I presented the code in C only because C is lingua franca of
comp.arch.

[toc] | [prev] | [next] | [standalone]

Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →

csiph-web

A useless machine

Contents

#115094

#115110

#115114

#114980

#114982

#114985

#114992

#114993

#114998

#115001

#115154

#115034

#115038

#115041

#115066

#115068

#115070

#115084

#115087

#115092