Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.arch > #114964 > unrolled thread

A useless machine

Started byThomas Koenig <tkoenig@netcologne.de>
First post2026-02-14 09:57 +0000
Last post2026-02-18 21:56 +0100
Articles 20 on this page of 81 — 12 participants

Back to article view | Back to comp.arch


Contents

  A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-14 09:57 +0000
    Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 18:48 +0000
      Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-14 20:18 +0100
        Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 21:22 +0000
          Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-14 13:30 -0800
          Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 12:50 +0100
            Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 13:48 -0600
              Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-21 21:40 +0100
                Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 15:17 -0600
                  Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 04:50 -0600
                Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:01 -0700
              Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 14:43 +0100
                Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 12:38 -0600
                  Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:51 +0100
                    Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-26 01:27 +0000
                      Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-26 09:21 +0100
                      Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:11 -0700
        Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:58 +0100
        Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 12:27 -0600
          Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 15:14 +0100
        Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-22 11:41 -0500
          Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:53 +0100
            Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-23 10:52 -0500
    Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-02-14 21:19 -0800
    Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 03:58 -0500
      Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 13:14 +0100
        Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 15:46 +0000
        Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 10:56 -0500
          Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-15 19:16 +0000
            Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:40 +0100
              Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-26 19:31 +0000
          Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-18 18:22 +0200
            Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-18 22:56 -0500
              Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 12:10 +0200
                Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-21 01:56 -0500
                  Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 19:23 +0200
                  Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 20:36 +0200
                    Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 10:46 +0200
                      Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-22 12:45 +0100
                        Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 16:35 +0200
                    Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:12 -0500
                      Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:47 -0500
                  Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-21 15:39 -0500
                    Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 03:16 +0200
                      Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-22 14:48 -0500
                        Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 23:25 +0200
              Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-19 11:51 +0100
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 23:12 +0200
                  Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-20 15:11 +0100
              Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-19 15:27 -0500
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-20 11:41 +0200
                Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 18:41 -0600
                  Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-23 11:10 +0000
                    Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-23 17:24 -0600
                      Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 06:44 +0000
                        Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 01:38 -0600
                          Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 03:16 -0600
                            Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 04:52 -0600
                          Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 11:16 +0000
                            Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 13:19 -0600
                        Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:14 -0700
            Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-20 08:01 +0000
      Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 15:35 +0200
        Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 18:33 +0200
          Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:19 +0000
            Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:31 +0200
              Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 15:41 +0200
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 18:17 +0200
      Re: A useless machine anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-02-15 22:41 +0000
    Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:52 +0100
      Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 16:45 +0000
        Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 18:01 +0100
          Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:42 +0000
            Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:59 +0100
              Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 21:53 -0500
                Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:40 +0200
                  Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 20:34 +0200
                Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-16 16:08 +0100
    Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-16 07:50 -0800
      Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 18:31 +0200
      Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-18 21:56 +0100

Page 4 of 5 — ← Prev page 1 2 3 [4] 5  Next page →


#115311

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2026-03-12 21:14 -0700
Message-ID<86h5qk74cs.fsf@linuxsc.com>
In reply to#115122
Thomas Koenig <tkoenig@netcologne.de> writes:

> BGB <cr88192@gmail.com> schrieb:
>
>> My own looking into this led me to realize that the number-space is
>> modulo (relative to a cone with a spiral that repeats with every
>> power-of-2), and that all the odd steps only ever increment the phase
>> angle by 1, or fall down to a lower level (whenever the phase angle is
>> not incremented).
>>
>> In effect, each odd step jumps to the opposite side of the cone, but
>> slightly increments the angle.  Since the phase-angle path can never move
>> backwards, it can't touch any previous paths, thus, no loops can happen
>> this way.
>>
>>
>> The only way a cycle could exist is if (somehow) there were an
>> impassable wall where pretty much *no* value could descend unless it
>> hits the power-of-2 descent path, and then somehow miss this path as well.
>>
>> But, since "3*n+1" results in the minimum forward step, this means that
>> one of the sides *has* to hit the power-of-2 descent cases at some point.
>
> The same argument would hold for the 3n-1 problem, where there is
> more than one cycle.

3*n-1 = -(3*(-n)+1)

So 3n-1 is just the same as 3n+1 on negative numbers.

[toc] | [prev] | [next] | [standalone]


#115055

FromThomas Koenig <tkoenig@netcologne.de>
Date2026-02-20 08:01 +0000
Message-ID<10n94db$756i$1@dont-email.me>
In reply to#115034
Michael S <already5chosen@yahoo.com> schrieb:

> You should look for 3 sorts of mistakes
> 1. Core itself.
> The most likely mistake is failing to look for one of the ending
> conditions. You have to look for 3 conditions:
> a) - value produced at current step is smaller than initial value. 
>  That is successful ending. Here you signal to the scheduler that you
>  are ready to accept the next value.
> b) - overflow. Next value exceeds the width of register.
> c) - timeout. You were running for more than N steps. N=8192 looks
> like the reasonable value.

Looking at the links in https://oeis.org/A006878 , anything above
2442 would be a new record, so 8192 would be plenty.
-- 
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

[toc] | [prev] | [next] | [standalone]


#114987

FromMichael S <already5chosen@yahoo.com>
Date2026-02-15 15:35 +0200
Message-ID<20260215153526.00005240@yahoo.com>
In reply to#114982
On Sun, 15 Feb 2026 03:58:28 -0500
Robert Finch <robfi680@gmail.com> wrote:

> On 2026-02-14 4:57 a.m., Thomas Koenig wrote:
> > Some people are throwing large amounts of computing power at trying
> > to disprove the Collatz conjecture.  Graphics cards have sped this
> > up enormously.
> > 
> > (A reminder, if any is needed:  From a starting number n, the
> > transformation, recursively applied,
> > 
> > 
> >     f(n) = 3*n+1, if n is odd; =n/2 if n is even.
> > 
> > is conjectured to always reach 1 for any positive integer 1).
> > 
> > All the work is done on general-purpose hardware, which has many
> > capabilities that are not needed, and consume area and power that
> > special-purpose hardware would not need.  Also, the hardware
> > is limited to 64-bit integers, and the range of tested numbers
> > is now up to 2^71, so
> > 
> > What could specialized hardware look like?
> > 
> > The central part could be a unit which is handed a block of numbers
> > to check, for example a range of 2^32 numbers.  The current records
> > for number of iterations and maximum number reached would also be
> > stored in that unit.
> > 
> > In that unit, the calculaiton would be performed by a combination
> > of adder and shifter.  The input would always be odd (so the last
> > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).
> > 
> > The adder would compute (3*n+1)/2, and would be specialzed 128-bit
> > adder.  Why specialized?  The logic is far less complex than a
> > general-purpose adder.  For example, a two-bit adder has the formula
> > 
> > y1 = (!a1&a0) | (!c_in&a1&!a0);
> > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);
> > 
> > and its propagate and generate bits are
> > 
> > p = (a1);
> > g = (a1&a0);
> > 
> > The maximum number of trailing zeros after the (3*n+1)/2 step is
> > three and can be determined cheaply from bits <3:1> of the input,
> > so the input to the shifter (two levels only) can be provided
> > in parallel.  The machine would be designed that adding and
> > shifting can be done in a single cycle.
> > 
> > The number of iterations would be tracked; if larger than a prevous
> > record, the calculation would terminate with a corresponding signal
> > to the outside world.
> > 
> > Also, each cycle, the result from the previous round would be
> > compared against the input value.  If it is lower, the next number
> > would be chosen.  Finally, the number reached would be checked
> > against the previous, pre-set maximum, with would also signal to
> > the outside world.
> > 
> > Which numbers to chose?  It is well known that only certain
> > numbers with remainders modulo 2^n are interesting.  The number
> > can be seen on OEIS A076227.  For example, modulo 2^10, only 64
> > numbers are "interesting", or 6.25% (but only eight bits would
> > be needed to store them because the last two are always one).
> > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
> > amount of numbers to be checked by half.  One would use a small ROM,
> > I presume. The number of bits would be a tradeoff between area and
> > amount of compuation. My current guess would be that 2^10 could
> > be a sweet spot.  One third of the numbers would further be
> > excluded by checking for divisiblity by three.
> > 
> > The problem is embarassingly parallel.  One could chose a size so
> > each parcel is calculated in a few seconds or even minutes. to avoid
> > having to avoid a large amount of communication with the outside.
> > 
> > Because all units would be working all the time, and if one has
> > a smaller unit one could simply build more of them, I suspect that
> > power would be the main limiting factor.  An adder design that
> > might not be very fast, but uses lower power, low voltage and low
> > frequency would be beneficial.
> > 
> > Comments?
> >   
> I wonder how difficult it would be to implement a parallel tester in
> an FPGA. It looks simple enough and there are hundreds of DSP blocks 
> available to use. Could test in blocks of hundreds of numbers at a
> time. Running at 100 MHz * 200 tests at a time = how long to get to
> 2^71?
> 

I am not totally sure what exactly you are asking, but if you are
asking about how much time it takes at that rate to do 2**71 iterations
then the answer is ~3743 years.
But you probably already figured it out yourself.

Of course, brute force demonstration of correction of conjecture for all
numbers in range 1 to 2**71 will take more steps than 2**71. And it
is unknown how much more.

> I wonder if the calculation can be broken down further. Numbers are
> all just defined by rules. I wonder if some calculus may apply. Limit
> of function approaches one for a finite number of steps. There may be
> a whole range of algorithms, is there one for the number two? Then
> the number three?
> 
> Occasionally I have made useless machinery. Great puzzle work. Made
> the game of life in hardware.
> 

[toc] | [prev] | [next] | [standalone]


#114994

FromMichael S <already5chosen@yahoo.com>
Date2026-02-15 18:33 +0200
Message-ID<20260215183335.0000649e@yahoo.com>
In reply to#114987
On Sun, 15 Feb 2026 15:35:26 +0200
Michael S <already5chosen@yahoo.com> wrote:

> On Sun, 15 Feb 2026 03:58:28 -0500
> Robert Finch <robfi680@gmail.com> wrote:
> 
> > On 2026-02-14 4:57 a.m., Thomas Koenig wrote:  
> > > Some people are throwing large amounts of computing power at
> > > trying to disprove the Collatz conjecture.  Graphics cards have
> > > sped this up enormously.
> > > 
> > > (A reminder, if any is needed:  From a starting number n, the
> > > transformation, recursively applied,
> > > 
> > > 
> > >     f(n) = 3*n+1, if n is odd; =n/2 if n is even.
> > > 
> > > is conjectured to always reach 1 for any positive integer 1).
> > > 
> > > All the work is done on general-purpose hardware, which has many
> > > capabilities that are not needed, and consume area and power that
> > > special-purpose hardware would not need.  Also, the hardware
> > > is limited to 64-bit integers, and the range of tested numbers
> > > is now up to 2^71, so
> > > 
> > > What could specialized hardware look like?
> > > 
> > > The central part could be a unit which is handed a block of
> > > numbers to check, for example a range of 2^32 numbers.  The
> > > current records for number of iterations and maximum number
> > > reached would also be stored in that unit.
> > > 
> > > In that unit, the calculaiton would be performed by a combination
> > > of adder and shifter.  The input would always be odd (so the last
> > > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).
> > > 
> > > The adder would compute (3*n+1)/2, and would be specialzed 128-bit
> > > adder.  Why specialized?  The logic is far less complex than a
> > > general-purpose adder.  For example, a two-bit adder has the
> > > formula
> > > 
> > > y1 = (!a1&a0) | (!c_in&a1&!a0);
> > > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);
> > > 
> > > and its propagate and generate bits are
> > > 
> > > p = (a1);
> > > g = (a1&a0);
> > > 
> > > The maximum number of trailing zeros after the (3*n+1)/2 step is
> > > three and can be determined cheaply from bits <3:1> of the input,
> > > so the input to the shifter (two levels only) can be provided
> > > in parallel.  The machine would be designed that adding and
> > > shifting can be done in a single cycle.
> > > 
> > > The number of iterations would be tracked; if larger than a
> > > prevous record, the calculation would terminate with a
> > > corresponding signal to the outside world.
> > > 
> > > Also, each cycle, the result from the previous round would be
> > > compared against the input value.  If it is lower, the next number
> > > would be chosen.  Finally, the number reached would be checked
> > > against the previous, pre-set maximum, with would also signal to
> > > the outside world.
> > > 
> > > Which numbers to chose?  It is well known that only certain
> > > numbers with remainders modulo 2^n are interesting.  The number
> > > can be seen on OEIS A076227.  For example, modulo 2^10, only 64
> > > numbers are "interesting", or 6.25% (but only eight bits would
> > > be needed to store them because the last two are always one).
> > > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
> > > amount of numbers to be checked by half.  One would use a small
> > > ROM, I presume. The number of bits would be a tradeoff between
> > > area and amount of compuation. My current guess would be that
> > > 2^10 could be a sweet spot.  One third of the numbers would
> > > further be excluded by checking for divisiblity by three.
> > > 
> > > The problem is embarassingly parallel.  One could chose a size so
> > > each parcel is calculated in a few seconds or even minutes. to
> > > avoid having to avoid a large amount of communication with the
> > > outside.
> > > 
> > > Because all units would be working all the time, and if one has
> > > a smaller unit one could simply build more of them, I suspect that
> > > power would be the main limiting factor.  An adder design that
> > > might not be very fast, but uses lower power, low voltage and low
> > > frequency would be beneficial.
> > > 
> > > Comments?
> > >     
> > I wonder how difficult it would be to implement a parallel tester in
> > an FPGA. It looks simple enough and there are hundreds of DSP
> > blocks available to use. Could test in blocks of hundreds of
> > numbers at a time. Running at 100 MHz * 200 tests at a time = how
> > long to get to 2^71?
> >   
> 
> I am not totally sure what exactly you are asking, but if you are
> asking about how much time it takes at that rate to do 2**71
> iterations then the answer is ~3743 years.
> But you probably already figured it out yourself.
> 
> Of course, brute force demonstration of correction of conjecture for
> all numbers in range 1 to 2**71 will take more steps than 2**71. And
> it is unknown how much more.
> 

"Unknown" is a correct theoretically answer, but I can provide a very
educated guess that on average it would take ~500 steps and the worst
case would not exceed 2000 steps.
So, bad news are that overall it will take more than a million years on
your machine. And good news are that it'll take less than 2 million
years.

[toc] | [prev] | [next] | [standalone]


#114999

FromThomas Koenig <tkoenig@netcologne.de>
Date2026-02-15 19:19 +0000
Message-ID<10mt67k$996a$1@dont-email.me>
In reply to#114994
Michael S <already5chosen@yahoo.com> schrieb:

> "Unknown" is a correct theoretically answer, but I can provide a very
> educated guess that on average it would take ~500 steps and the worst
> case would not exceed 2000 steps.
> So, bad news are that overall it will take more than a million years on
> your machine. And good news are that it'll take less than 2 million
> years.

Not as bad, very many simplifications are possible.

For one, you can discount all even numbers as starting values.
You can go further: If you look at residues modulo 2^n of your starting
numbers, there are only A076227(n) "interesting" remaiders that
you need to look at.  If you take n=39, only 3611535862 cases
need to be looked at, so only 3611535862/2^39 of all cases need to
be looked at, a reduction of a factor around 150.  You can also
do some more reductions (only 2/3 of those numbers need checking
if you can calculate the remainder by 3 fast).  There are different
tradeoffs for different n, of course (a factor of 16 for n=10,
for example).

The main key is parallelization.
-- 
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

[toc] | [prev] | [next] | [standalone]


#115005

FromMichael S <already5chosen@yahoo.com>
Date2026-02-16 12:31 +0200
Message-ID<20260216123131.00003e22@yahoo.com>
In reply to#114999
On Sun, 15 Feb 2026 19:19:16 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

> Michael S <already5chosen@yahoo.com> schrieb:
> 
> > "Unknown" is a correct theoretically answer, but I can provide a
> > very educated guess that on average it would take ~500 steps and
> > the worst case would not exceed 2000 steps.
> > So, bad news are that overall it will take more than a million
> > years on your machine. And good news are that it'll take less than
> > 2 million years.  
> 
> Not as bad, very many simplifications are possible.
> 
> For one, you can discount all even numbers as starting values.

O.k.

> You can go further: If you look at residues modulo 2^n of your
> starting numbers, there are only A076227(n) "interesting" remaiders
> that you need to look at.

Can not say that I understood.

> If you take n=39, only 3611535862 cases
> need to be looked at, so only 3611535862/2^39 of all cases need to
> be looked at, a reduction of a factor around 150.  You can also
> do some more reductions (only 2/3 of those numbers need checking
> if you can calculate the remainder by 3 fast).  There are different
> tradeoffs for different n, of course (a factor of 16 for n=10,
> for example).
> 


The estimate above was done for loop that finishes at 1.
For brute-force disproof  of Collatz conjecture it is unnecessary. We
can stop at n < n0. I think, you mentioned it in your original post,
but I didn't pay attention. 
Due to early exit mean number of iteration is cut from ~500 to 7 or at
worst 8. 

> The main key is parallelization.

And here is a problem.
If one want to finish in, say, 1 month, one would need ALOT of
parallelization. Even after all improvements we are talking about
several millions of testing nodes. At such scale I can think about two
main problems:
1. Dispatcher become a bottleneck
2. Some node will fail. It has to be detected and their jobs
re-allocated. Which further complicates a scheduler.

It does not sound particularly hard if somebody pays for engineering,
esp. if the whole work done in one dedicated data center.
Otherwise it does not sound easy.

The worst part is that that there is nothing special about 2**71.
Or about 2**80 for that matter. Tim Reintch already said that better
than I can. The machine is not merely useless, it is pointless.











[toc] | [prev] | [next] | [standalone]


#115016

FromMichael S <already5chosen@yahoo.com>
Date2026-02-17 15:41 +0200
Message-ID<20260217154123.00000257@yahoo.com>
In reply to#115005
On Mon, 16 Feb 2026 12:31:31 +0200
Michael S <already5chosen@yahoo.com> wrote:

> On Sun, 15 Feb 2026 19:19:16 -0000 (UTC)
> Thomas Koenig <tkoenig@netcologne.de> wrote:
> 
> > Michael S <already5chosen@yahoo.com> schrieb:
> >   
> > > "Unknown" is a correct theoretically answer, but I can provide a
> > > very educated guess that on average it would take ~500 steps and
> > > the worst case would not exceed 2000 steps.
> > > So, bad news are that overall it will take more than a million
> > > years on your machine. And good news are that it'll take less than
> > > 2 million years.    
> > 
> > Not as bad, very many simplifications are possible.
> > 
> > For one, you can discount all even numbers as starting values.  
> 
> O.k.
> 
> > You can go further: If you look at residues modulo 2^n of your
> > starting numbers, there are only A076227(n) "interesting" remaiders
> > that you need to look at.  
> 
> Can not say that I understood.
> 
> > If you take n=39, only 3611535862 cases
> > need to be looked at, so only 3611535862/2^39 of all cases need to
> > be looked at, a reduction of a factor around 150.  You can also
> > do some more reductions (only 2/3 of those numbers need checking
> > if you can calculate the remainder by 3 fast).  There are different
> > tradeoffs for different n, of course (a factor of 16 for n=10,
> > for example).
> >   
> 

Hopefully now I figured out what you are talking about, except of 2/3
part and apart from the meaning of A076227(n).
I experimented a little with these ideas and it seems that reduction in
number of steps is nowhere near the reduction in number of tested cases.
My test vectors were 2**32 consecutive points starting at 2**52 or
2**53. I tried different modulo filters in range from 2**5 to 2**31.
Two modes of filtering were used. 
In "simple" mode I just filtered out "non-interesting" inputs, but
did nothing special about "interesting" inputs.
In "advanced" mode I accelerated processing of "interesting" inputs by
means of starting their testing from point x0= (x/M)*A[x%M]+B[x%M],
where tables A[] and B[] were pre-calculated during filter build step.
Obviously, "advanced" mode filtering requires ~3 times more memory plus
multiplier + plus some logic.

Here are my results:
$ ./nsteps2 4503599627370496 0x100000000
Initial value = 4503599627370496 (0010000000000000). 
#values = 4294967296 (0000000100000000)
Unfiltered  : 22501638882 steps. 4294967296 tests.  5.239 steps/iter
n*4+3       : 17132929762 steps. 1073741824 tests. 15.956 steps/iter
n*2**5+m    : 15924970210 steps.  939524096 tests. 16.950 steps/iter
n*2**5+m  * :  8408777442 steps.  939524096 tests.  8.950 steps/iter
n*2**10+m   : 11948770018 steps.  503316480 tests. 23.740 steps/iter
n*2**10+m * :  7088288929 steps.  503316480 tests. 14.083 steps/iter
n*2**15+m   :  9003713250 steps.  289013760 tests. 31.153 steps/iter
n*2**15+m * :  4818894105 steps.  289013760 tests. 16.674 steps/iter
n*2**20+m   :  6973854434 steps.  181379072 tests. 38.449 steps/iter
n*2**20+m * :  3496505674 steps.  181379072 tests. 19.277 steps/iter
n*2**25+m   :  5574414306 steps.  125809408 tests. 44.308 steps/iter
n*2**25+m * :  2543845543 steps.  125809408 tests. 20.220 steps/iter
n*2**29+m   :  4660796778 steps.   95710352 tests. 48.697 steps/iter
n*2**29+m * :  2008917951 steps.   95710352 tests. 20.990 steps/iter
n*2**30+m   :  4493358006 steps.   89757100 tests. 50.061 steps/iter
n*2**30+m * :  1915335033 steps.   89757100 tests. 21.339 steps/iter
n*2**31+m   :  4212222570 steps.   81419608 tests. 51.735 steps/iter
n*2**31+m * :  1780554493 steps.   81419608 tests. 21.869 steps/iter

$ ./nsteps2 0x20000000000000 0x100000000
Initial value = 9007199254740992 (0020000000000000). 
#values = 4294967296 (0000000100000000)
Unfiltered  : 22501540484 steps. 4294967296 tests.  5.239 steps/iter
n*4+3       : 17132831364 steps. 1073741824 tests. 15.956 steps/iter
n*2**5+m    : 15924871812 steps.  939524096 tests. 16.950 steps/iter
n*2**5+m  * :  8408679044 steps.  939524096 tests.  8.950 steps/iter
n*2**10+m   : 11948671620 steps.  503316480 tests. 23.740 steps/iter
n*2**10+m * :  7088271089 steps.  503316480 tests. 14.083 steps/iter
n*2**15+m   :  9003614852 steps.  289013760 tests. 31.153 steps/iter
n*2**15+m * :  4818766141 steps.  289013760 tests. 16.673 steps/iter
n*2**20+m   :  6973756036 steps.  181379072 tests. 38.449 steps/iter
n*2**20+m * :  3496390679 steps.  181379072 tests. 19.277 steps/iter
n*2**25+m   :  5574315908 steps.  125809408 tests. 44.308 steps/iter
n*2**25+m * :  2543741535 steps.  125809408 tests. 20.219 steps/iter
n*2**29+m   :  4660698380 steps.   95710352 tests. 48.696 steps/iter
n*2**29+m * :  2008935493 steps.   95710352 tests. 20.990 steps/iter
n*2**30+m   :  4493259608 steps.   89757100 tests. 50.060 steps/iter
n*2**30+m * :  1915241920 steps.   89757100 tests. 21.338 steps/iter
n*2**31+m   :  4212124172 steps.   81419608 tests. 51.734 steps/iter
n*2**31+m * :  1780581916 steps.   81419608 tests. 21.869 steps/iter
'Advanced' mode marked with '*'

As you can see, 2**10 filter reduces # of processed points by factor of
8.5, but number of processing steps reduced only 1.9x in simple mode
and 3.2x in advanced mode.
For 2**20 filter ratios are, respectively, 23.7, 3.2 and 6.4
For 2**31 filter, 52.7, 5.3 and 12.6

Pay attention that for implementation in FPGA even filtering with
modulo 2**29 (~12M entries) is impractical. Assuming advanced mode, 12M
entries occupy 1000 Mbits. In non-expensive FPGA it simply does not
fit. In very expensive FPGA it fits, but barely. 

Let's take, for example, Altera Stratix-10 GX 2800. By the standard of
previous decade, it is a real behemoth with 2.75M Logic Elements (LEs).
Assuming that each tester unit occupies 500 LEs there is room for 5500
units. Looking into my tables above, you want 1 filter per ~20 tester
units, so 275 filters needed. And how many available? Just two!

For modulo 2**20 it's a little better. Each filter consists of ~45,000
entries and occupies ~2.7 Mbits. So you can fit 80+ filters in GX 2800.
But you want 5500 / 19 = 290, so still unbalanced.

For modulo 2**15 (2205 entries, ~100 Kbits) there is enough memory to
fit desired number of replica. But small filter like that is
significantly less effective, reducing # of steps (in advanced mode)
only by 4.7x.

It is possible that there exist smarter algorithms for partition of
work between parallel tester units that can greatly reduce need for
access to filtering tables relatively to simple scheme that I assumed
in above estimates. But I think that they point stands - this sort of
filtering is not a silver bullet. It can improve throughput, but it's
unlikely that it can improve it by more than one order of magnitude.


 





[toc] | [prev] | [next] | [standalone]


#115017

FromMichael S <already5chosen@yahoo.com>
Date2026-02-17 18:17 +0200
Message-ID<20260217181710.00007342@yahoo.com>
In reply to#115016
On Tue, 17 Feb 2026 15:41:23 +0200
Michael S <already5chosen@yahoo.com> wrote:

> On Mon, 16 Feb 2026 12:31:31 +0200
> Michael S <already5chosen@yahoo.com> wrote:
> 
> > On Sun, 15 Feb 2026 19:19:16 -0000 (UTC)
> > Thomas Koenig <tkoenig@netcologne.de> wrote:
> >   
> > > Michael S <already5chosen@yahoo.com> schrieb:
> > >     
> > > > "Unknown" is a correct theoretically answer, but I can provide a
> > > > very educated guess that on average it would take ~500 steps and
> > > > the worst case would not exceed 2000 steps.
> > > > So, bad news are that overall it will take more than a million
> > > > years on your machine. And good news are that it'll take less
> > > > than 2 million years.      
> > > 
> > > Not as bad, very many simplifications are possible.
> > > 
> > > For one, you can discount all even numbers as starting values.    
> > 
> > O.k.
> >   
> > > You can go further: If you look at residues modulo 2^n of your
> > > starting numbers, there are only A076227(n) "interesting"
> > > remaiders that you need to look at.    
> > 
> > Can not say that I understood.
> >   
> > > If you take n=39, only 3611535862 cases
> > > need to be looked at, so only 3611535862/2^39 of all cases need to
> > > be looked at, a reduction of a factor around 150.  You can also
> > > do some more reductions (only 2/3 of those numbers need checking
> > > if you can calculate the remainder by 3 fast).  There are
> > > different tradeoffs for different n, of course (a factor of 16
> > > for n=10, for example).
> > >     
> >   
> 
> Hopefully now I figured out what you are talking about, except of 2/3
> part and apart from the meaning of A076227(n).
> I experimented a little with these ideas and it seems that reduction
> in number of steps is nowhere near the reduction in number of tested
> cases. My test vectors were 2**32 consecutive points starting at
> 2**52 or 2**53. I tried different modulo filters in range from 2**5
> to 2**31. Two modes of filtering were used. 
> In "simple" mode I just filtered out "non-interesting" inputs, but
> did nothing special about "interesting" inputs.
> In "advanced" mode I accelerated processing of "interesting" inputs by
> means of starting their testing from point x0= (x/M)*A[x%M]+B[x%M],
> where tables A[] and B[] were pre-calculated during filter build step.
> Obviously, "advanced" mode filtering requires ~3 times more memory
> plus multiplier + plus some logic.
> 
> Here are my results:
> $ ./nsteps2 4503599627370496 0x100000000
> Initial value = 4503599627370496 (0010000000000000). 
> #values = 4294967296 (0000000100000000)
> Unfiltered  : 22501638882 steps. 4294967296 tests.  5.239 steps/iter
> n*4+3       : 17132929762 steps. 1073741824 tests. 15.956 steps/iter
> n*2**5+m    : 15924970210 steps.  939524096 tests. 16.950 steps/iter
> n*2**5+m  * :  8408777442 steps.  939524096 tests.  8.950 steps/iter
> n*2**10+m   : 11948770018 steps.  503316480 tests. 23.740 steps/iter
> n*2**10+m * :  7088288929 steps.  503316480 tests. 14.083 steps/iter
> n*2**15+m   :  9003713250 steps.  289013760 tests. 31.153 steps/iter
> n*2**15+m * :  4818894105 steps.  289013760 tests. 16.674 steps/iter
> n*2**20+m   :  6973854434 steps.  181379072 tests. 38.449 steps/iter
> n*2**20+m * :  3496505674 steps.  181379072 tests. 19.277 steps/iter
> n*2**25+m   :  5574414306 steps.  125809408 tests. 44.308 steps/iter
> n*2**25+m * :  2543845543 steps.  125809408 tests. 20.220 steps/iter
> n*2**29+m   :  4660796778 steps.   95710352 tests. 48.697 steps/iter
> n*2**29+m * :  2008917951 steps.   95710352 tests. 20.990 steps/iter
> n*2**30+m   :  4493358006 steps.   89757100 tests. 50.061 steps/iter
> n*2**30+m * :  1915335033 steps.   89757100 tests. 21.339 steps/iter
> n*2**31+m   :  4212222570 steps.   81419608 tests. 51.735 steps/iter
> n*2**31+m * :  1780554493 steps.   81419608 tests. 21.869 steps/iter
> 
> $ ./nsteps2 0x20000000000000 0x100000000
> Initial value = 9007199254740992 (0020000000000000). 
> #values = 4294967296 (0000000100000000)
> Unfiltered  : 22501540484 steps. 4294967296 tests.  5.239 steps/iter
> n*4+3       : 17132831364 steps. 1073741824 tests. 15.956 steps/iter
> n*2**5+m    : 15924871812 steps.  939524096 tests. 16.950 steps/iter
> n*2**5+m  * :  8408679044 steps.  939524096 tests.  8.950 steps/iter
> n*2**10+m   : 11948671620 steps.  503316480 tests. 23.740 steps/iter
> n*2**10+m * :  7088271089 steps.  503316480 tests. 14.083 steps/iter
> n*2**15+m   :  9003614852 steps.  289013760 tests. 31.153 steps/iter
> n*2**15+m * :  4818766141 steps.  289013760 tests. 16.673 steps/iter
> n*2**20+m   :  6973756036 steps.  181379072 tests. 38.449 steps/iter
> n*2**20+m * :  3496390679 steps.  181379072 tests. 19.277 steps/iter
> n*2**25+m   :  5574315908 steps.  125809408 tests. 44.308 steps/iter
> n*2**25+m * :  2543741535 steps.  125809408 tests. 20.219 steps/iter
> n*2**29+m   :  4660698380 steps.   95710352 tests. 48.696 steps/iter
> n*2**29+m * :  2008935493 steps.   95710352 tests. 20.990 steps/iter
> n*2**30+m   :  4493259608 steps.   89757100 tests. 50.060 steps/iter
> n*2**30+m * :  1915241920 steps.   89757100 tests. 21.338 steps/iter
> n*2**31+m   :  4212124172 steps.   81419608 tests. 51.734 steps/iter
> n*2**31+m * :  1780581916 steps.   81419608 tests. 21.869 steps/iter
> 'Advanced' mode marked with '*'
> 
> As you can see, 2**10 filter reduces # of processed points by factor
> of 8.5, but number of processing steps reduced only 1.9x in simple
> mode and 3.2x in advanced mode.
> For 2**20 filter ratios are, respectively, 23.7, 3.2 and 6.4
> For 2**31 filter, 52.7, 5.3 and 12.6
> 


There was bug in my testing code. In reality filtering is somewhat more
efficient. Here are corrected results:

Initial value = 4503599627370496 (0010000000000000).
#values = 4294967296 (0000000100000000)
Unfiltered  : 22501638882 steps. 4294967296 tests.  5.239 steps/iter
n*4+3       : 17132929762 steps. 1073741824 tests. 15.956 steps/iter
n*2**5+m    : 13374833378 steps.  536870912 tests. 24.913 steps/iter
n*2**5+m  * :  8408777442 steps.  536870912 tests. 15.663 steps/iter
n*2**10+m   :  9935504098 steps.  268435456 tests. 37.013 steps/iter
n*2**10+m * :  5187551970 steps.  268435456 tests. 19.325 steps/iter
n*2**15+m   :  7857750754 steps.  169738240 tests. 46.293 steps/iter
n*2**15+m * :  3449537250 steps.  169738240 tests. 20.323 steps/iter
n*2**20+m   :  6242468578 steps.  111935488 tests. 55.768 steps/iter
n*2**20+m * :  2410906338 steps.  111935488 tests. 21.538 steps/iter
n*2**25+m   :  4838650338 steps.   73364736 tests. 65.953 steps/iter
n*2**25+m * :  1716934242 steps.   73364736 tests. 23.403 steps/iter
n*2**29+m   :  3856366938 steps.   51085096 tests. 75.489 steps/iter
n*2**29+m * :  1338119106 steps.   51085096 tests. 26.194 steps/iter
n*2**30+m   :  3856366938 steps.   51085096 tests. 75.489 steps/iter
n*2**30+m * :  1261491462 steps.   51085096 tests. 24.694 steps/iter
n*2**31+m   :  3666319938 steps.   47284156 tests. 77.538 steps/iter
n*2**31+m * :  1184863818 steps.   47284156 tests. 25.058 steps/iter

Initial value = 9007199254740992 (0020000000000000).
#values = 4294967296 (0000000100000000)
Unfiltered  : 22501540484 steps. 4294967296 tests.  5.239 steps/iter
n*4+3       : 17132831364 steps. 1073741824 tests. 15.956 steps/iter
n*2**5+m    : 13374734980 steps.  536870912 tests. 24.912 steps/iter
n*2**5+m  * :  8408679044 steps.  536870912 tests. 15.662 steps/iter
n*2**10+m   :  9935405700 steps.  268435456 tests. 37.012 steps/iter
n*2**10+m * :  5187453572 steps.  268435456 tests. 19.325 steps/iter
n*2**15+m   :  7857652356 steps.  169738240 tests. 46.293 steps/iter
n*2**15+m * :  3449438852 steps.  169738240 tests. 20.322 steps/iter
n*2**20+m   :  6242370180 steps.  111935488 tests. 55.768 steps/iter
n*2**20+m * :  2410807940 steps.  111935488 tests. 21.537 steps/iter
n*2**25+m   :  4838551940 steps.   73364736 tests. 65.952 steps/iter
n*2**25+m * :  1716835844 steps.   73364736 tests. 23.401 steps/iter
n*2**29+m   :  3856268540 steps.   51085096 tests. 75.487 steps/iter
n*2**29+m * :  1338020708 steps.   51085096 tests. 26.192 steps/iter
n*2**30+m   :  3856268540 steps.   51085096 tests. 75.487 steps/iter
n*2**30+m * :  1261393064 steps.   51085096 tests. 24.692 steps/iter
n*2**31+m   :  3666221540 steps.   47284156 tests. 77.536 steps/iter
n*2**31+m * :  1184765420 steps.   47284156 tests. 25.056 steps/iter

2**10 filter reduces # of processed points by factor of 16.0, but 
number of processing steps reduced only 2.7x in simple mode and 4.3x
in advanced mode.
For 2**20 filter ratios are, respectively, 38.4, 3.6 and 9.3
For 2**31 filter: 90.8, 6.1 and 19.0


> Pay attention that for implementation in FPGA even filtering with
> modulo 2**29 (~12M entries) is impractical. Assuming advanced mode,
> 12M entries occupy 1000 Mbits. In non-expensive FPGA it simply does
> not fit. In very expensive FPGA it fits, but barely. 
> 
> Let's take, for example, Altera Stratix-10 GX 2800. By the standard of
> previous decade, it is a real behemoth with 2.75M Logic Elements
> (LEs). Assuming that each tester unit occupies 500 LEs there is room
> for 5500 units. Looking into my tables above, you want 1 filter per
> ~20 tester units, so 275 filters needed. And how many available? Just
> two!
> 

And here I made even bigger mistake. The size of internal memory
of GX 2800 is 229 Mbit. So 2**29 filter does not fit at all.

> For modulo 2**20 it's a little better. Each filter consists of ~45,000
> entries and occupies ~2.7 Mbits. So you can fit 80+ filters in GX
> 2800. But you want 5500 / 19 = 290, so still unbalanced.
> 

Here correct numbers are as folowing:
Each filter consists of 27,328 entries and occupies ~1.7 Mbits. So you
can fit 130+ filters in GX 2800. But you want 5500 / 19 = 290, so still
unbalanced.

> For modulo 2**15 (2205 entries, ~100 Kbits) there is enough memory to
> fit desired number of replica. But small filter like that is
> significantly less effective, reducing # of steps (in advanced mode)
> only by 4.7x.
> 

Here it should be:
For modulo 2**15 (1295 entries, ~60 Kbits) there is enough memory to
fit desired number of replica. But small filter like that is
significantly less effective, reducing # of steps (in advanced mode)
only by 6.5x.

> It is possible that there exist smarter algorithms for partition of
> work between parallel tester units that can greatly reduce need for
> access to filtering tables relatively to simple scheme that I assumed
> in above estimates. But I think that they point stands - this sort of
> filtering is not a silver bullet. It can improve throughput, but it's
> unlikely that it can improve it by more than one order of magnitude.
> 

Despite corrected mistakes my conclusions are the same.

[toc] | [prev] | [next] | [standalone]


#115003

Fromanton@mips.complang.tuwien.ac.at (Anton Ertl)
Date2026-02-15 22:41 +0000
Message-ID<2026Feb15.234140@mips.complang.tuwien.ac.at>
In reply to#114982
Robert Finch <robfi680@gmail.com> writes:
>I wonder how difficult it would be to implement a parallel tester in an 
>FPGA. It looks simple enough and there are hundreds of DSP blocks 
>available to use. Could test in blocks of hundreds of numbers at a time.
>Running at 100 MHz * 200 tests at a time = how long to get to 2^71?

OTOH, with a PC with 8 cores running at 4GHZ, even with scalar
operations you might outperform that.  And with SIMD probably even
more.  The main problem is that support for 128-bit integers is not so
great for scalar instructions and worse for SIMD instructions.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

[toc] | [prev] | [next] | [standalone]


#114988

FromTerje Mathisen <terje.mathisen@tmsw.no>
Date2026-02-15 14:52 +0100
Message-ID<10msj2v$1va9$1@dont-email.me>
In reply to#114964
Thomas Koenig wrote:
> Some people are throwing large amounts of computing power at trying
> to disprove the Collatz conjecture.  Graphics cards have sped this
> up enormously.
> 
> (A reminder, if any is needed:  From a starting number n, the
> transformation, recursively applied,
> 
> 
>     f(n) = 3*n+1, if n is odd; =n/2 if n is even.
> 
> is conjectured to always reach 1 for any positive integer 1).
> 
> All the work is done on general-purpose hardware, which has many
> capabilities that are not needed, and consume area and power that
> special-purpose hardware would not need.  Also, the hardware
> is limited to 64-bit integers, and the range of tested numbers
> is now up to 2^71, so
> 
> What could specialized hardware look like?
> 
> The central part could be a unit which is handed a block of numbers
> to check, for example a range of 2^32 numbers.  The current records
> for number of iterations and maximum number reached would also be
> stored in that unit.
> 
> In that unit, the calculaiton would be performed by a combination
> of adder and shifter.  The input would always be odd (so the last
> bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).
> 
> The adder would compute (3*n+1)/2, and would be specialzed 128-bit
> adder.  Why specialized?  The logic is far less complex than a
> general-purpose adder.  For example, a two-bit adder has the formula
> 
> y1 = (!a1&a0) | (!c_in&a1&!a0);
> y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);
> 
> and its propagate and generate bits are
> 
> p = (a1);
> g = (a1&a0);
> 
> The maximum number of trailing zeros after the (3*n+1)/2 step is
> three and can be determined cheaply from bits <3:1> of the input,
> so the input to the shifter (two levels only) can be provided
> in parallel.  The machine would be designed that adding and
> shifting can be done in a single cycle.
> 
> The number of iterations would be tracked; if larger than a prevous
> record, the calculation would terminate with a corresponding signal
> to the outside world.
> 
> Also, each cycle, the result from the previous round would be
> compared against the input value.  If it is lower, the next number
> would be chosen.  Finally, the number reached would be checked against
> the previous, pre-set maximum, with would also signal to the outside
> world.
> 
> Which numbers to chose?  It is well known that only certain
> numbers with remainders modulo 2^n are interesting.  The number
> can be seen on OEIS A076227.  For example, modulo 2^10, only 64
> numbers are "interesting", or 6.25% (but only eight bits would
> be needed to store them because the last two are always one).
> Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
> amount of numbers to be checked by half.  One would use a small ROM,
> I presume. The number of bits would be a tradeoff between area and
> amount of compuation. My current guess would be that 2^10 could
> be a sweet spot.  One third of the numbers would further be
> excluded by checking for divisiblity by three.
> 
> The problem is embarassingly parallel.  One could chose a size so
> each parcel is calculated in a few seconds or even minutes. to avoid
> having to avoid a large amount of communication with the outside.
> 
> Because all units would be working all the time, and if one has
> a smaller unit one could simply build more of them, I suspect that
> power would be the main limiting factor.  An adder design that
> might not be very fast, but uses lower power, low voltage and low
> frequency would be beneficial.
> 
> Comments?
> 
I wrote a 3np1.c program a long time ago, with manual handling of the 
64-128 bit boundary.

It used lookup tables with multiplication, addition and shift values to 
be applied, since the bottom N bits directly control what will happen on 
the next iterations.

Worst case, if the bottom 8 bits are zero, then we'd do 8 times SHR 1, 
while at the other extreme we end up with 8 iterations of n+(n+1)/2, so 
approximately 1.5^8 which is around a factor of 20 growth over 16 
iterations, right?

I found my old code because I recently wrote a much simpler rust version 
of the same algorithm, this one was not branchless and stopped as soon 
as it found a number needing at least 64 bits.

As I worked on it I had this very strong deja vue feeling. :-)

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]


#114995

FromThomas Koenig <tkoenig@netcologne.de>
Date2026-02-15 16:45 +0000
Message-ID<10mst7j$5pef$1@dont-email.me>
In reply to#114988
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

> I wrote a 3np1.c program a long time ago, with manual handling of the 
> 64-128 bit boundary.

I believe a very large number of people have tried their hand at this
:-)

> It used lookup tables with multiplication, addition and shift values to 
> be applied, since the bottom N bits directly control what will happen on 
> the next iterations.
>
> Worst case, if the bottom 8 bits are zero,

There was an error in my previous post; I thought that the maximum
numbre of shifts was limited.  This is not true, because 3*n+1 can
be 2^n (obviously).  So, the number is not necessarily limited, not be 3
(as I thought) but also not by 8.

However, these numbers would occur quite rarely, so it could make sense
to still limit the shifts to three, but just bypass the adder if
the lowst number is still zero.

> then we'd do 8 times SHR 1, 
> while at the other extreme we end up with 8 iterations of n+(n+1)/2, so 
> approximately 1.5^8 which is around a factor of 20 growth over 16 
> iterations, right?

But that number is equal to the number of trailing ones in binary,
which can be larger than eight.

[...]

-- 
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

[toc] | [prev] | [next] | [standalone]


#114997

FromTerje Mathisen <terje.mathisen@tmsw.no>
Date2026-02-15 18:01 +0100
Message-ID<10msu5t$65om$1@dont-email.me>
In reply to#114995
Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> 
>> I wrote a 3np1.c program a long time ago, with manual handling of the
>> 64-128 bit boundary.
> 
> I believe a very large number of people have tried their hand at this
> :-)
> 
>> It used lookup tables with multiplication, addition and shift values to
>> be applied, since the bottom N bits directly control what will happen on
>> the next iterations.
>>
>> Worst case, if the bottom 8 bits are zero,
> 
> There was an error in my previous post; I thought that the maximum
> numbre of shifts was limited.  This is not true, because 3*n+1 can
> be 2^n (obviously).  So, the number is not necessarily limited, not be 3
> (as I thought) but also not by 8.
> 
> However, these numbers would occur quite rarely, so it could make sense
> to still limit the shifts to three, but just bypass the adder if
> the lowst number is still zero.
> 
>> then we'd do 8 times SHR 1,
>> while at the other extreme we end up with 8 iterations of n+(n+1)/2, so
>> approximately 1.5^8 which is around a factor of 20 growth over 16
>> iterations, right?
> 
> But that number is equal to the number of trailing ones in binary,
> which can be larger than eight.

Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR

Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs 
and a SHRD/SHR, so maybe 8-10 cycles?

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]


#115000

FromThomas Koenig <tkoenig@netcologne.de>
Date2026-02-15 19:42 +0000
Message-ID<10mt7jn$996a$2@dont-email.me>
In reply to#114997
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

> Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR
>
> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs 
> and a SHRD/SHR, so maybe 8-10 cycles?

https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
an improved algorithm, which they are using for the supercomputer
version.  They are doing:

n = n0
repeat
  n = n + 1
  alpha = ctz(n)
  n = n * 3^alpha / 2^alpha
  n = n - 1
  beta = ctz(n)
  n = n / 2^beta
until n < n_0

They are also playing some strange tricks with parallel solving that
are not described clearly, or I am not clever enough to understand
what they are doing. I would have to look at their code, I guess.

-- 
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

[toc] | [prev] | [next] | [standalone]


#115002

FromTerje Mathisen <terje.mathisen@tmsw.no>
Date2026-02-15 22:59 +0100
Message-ID<10mtfjo$ck6h$1@dont-email.me>
In reply to#115000
Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> 
>> Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR
>>
>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs
>> and a SHRD/SHR, so maybe 8-10 cycles?
> 
> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
> an improved algorithm, which they are using for the supercomputer
> version.  They are doing:
> 
> n = n0
> repeat
>    n = n + 1
>    alpha = ctz(n)
>    n = n * 3^alpha / 2^alpha

Strange syntax...

ctz(n) must count trailing zeroes, right?
If so, then ctz(n+1) will return 1 or more for odd n, so that gives 
3n+3, which after the division by 2 returns a value which is one too high.

>    n = n - 1

but this is corrected here!

The key must be that the multiplication by a power of three also works 
for n with multiple trailing 1 bits.

The "x/2^alpha" part is of course just x>>alpha.

For an even n they still do a multiplication by 1 and a SHR 0, but the 
code is branchless.

>    beta = ctz(n)
>    n = n / 2^beta
> until n < n_0

I am guessing that the best way to do the n*3^alpha parts is a table of 
powers of 3, so

let mut n = n0;
loop {
   n += 1;
   let alpha = ctz(n);
   n = (n * power_of_three[alpha]) >> alpha;
   n -= 1;
   beta = ctz(n);
   n >>= beta;
   if n <= n_0 {break};
}

> 
> They are also playing some strange tricks with parallel solving that
> are not described clearly, or I am not clever enough to understand
> what they are doing. I would have to look at their code, I guess.
> 
:-)

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]


#115004

FromRobert Finch <robfi680@gmail.com>
Date2026-02-15 21:53 -0500
Message-ID<10mu0qp$hqfc$1@dont-email.me>
In reply to#115002
On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
> Thomas Koenig wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>
>>> Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR
>>>
>>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs
>>> and a SHRD/SHR, so maybe 8-10 cycles?
>>
>> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
>> an improved algorithm, which they are using for the supercomputer
>> version.  They are doing:
>>
>> n = n0
>> repeat
>>    n = n + 1
>>    alpha = ctz(n)
>>    n = n * 3^alpha / 2^alpha
> 
> Strange syntax...
> 
> ctz(n) must count trailing zeroes, right?
> If so, then ctz(n+1) will return 1 or more for odd n, so that gives 
> 3n+3, which after the division by 2 returns a value which is one too high.
> 
>>    n = n - 1
> 
> but this is corrected here!
> 
> The key must be that the multiplication by a power of three also works 
> for n with multiple trailing 1 bits.
> 
> The "x/2^alpha" part is of course just x>>alpha.
> 
> For an even n they still do a multiplication by 1 and a SHR 0, but the 
> code is branchless.
> 
>>    beta = ctz(n)
>>    n = n / 2^beta
>> until n < n_0
> 
> I am guessing that the best way to do the n*3^alpha parts is a table of 
> powers of 3, so
> 
> let mut n = n0;
> loop {
>    n += 1;
>    let alpha = ctz(n);
>    n = (n * power_of_three[alpha]) >> alpha;
>    n -= 1;
>    beta = ctz(n);
>    n >>= beta;
>    if n <= n_0 {break};
> }
> 
>>
>> They are also playing some strange tricks with parallel solving that
>> are not described clearly, or I am not clever enough to understand
>> what they are doing. I would have to look at their code, I guess.
>>
> :-)
> 
> Terje
> 
How many iterations are done in one loop? I think it would need to be at 
least 10 on average to beat the simple approach. The simple approach 
only taking one clock per iteration. The 128-bit multiply and 128-bit 
count trailing zeros are slow in an FPGA.

And, can the approach consume a more limited number of iterations? For 
example counting only the 6 LSB zeros?

I found a mistake in my code handling groups. It can only do about a 
24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4 
million years?



[toc] | [prev] | [next] | [standalone]


#115006

FromMichael S <already5chosen@yahoo.com>
Date2026-02-16 12:40 +0200
Message-ID<20260216124054.000063b6@yahoo.com>
In reply to#115004
On Sun, 15 Feb 2026 21:53:10 -0500
Robert Finch <robfi680@gmail.com> wrote:

> On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
> > Thomas Koenig wrote:  
> >> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> >>  
> >>> Just thinking about a way to replace ~N iterations by a MUL, ADD
> >>> and SHR
> >>>
> >>> Even with 128-bit numbers, I would only need two MULs, two
> >>> ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles?  
> >>
> >> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
> >> an improved algorithm, which they are using for the supercomputer
> >> version.  They are doing:
> >>
> >> n = n0
> >> repeat
> >>    n = n + 1
> >>    alpha = ctz(n)
> >>    n = n * 3^alpha / 2^alpha  
> > 
> > Strange syntax...
> > 
> > ctz(n) must count trailing zeroes, right?
> > If so, then ctz(n+1) will return 1 or more for odd n, so that gives 
> > 3n+3, which after the division by 2 returns a value which is one
> > too high. 
> >>    n = n - 1  
> > 
> > but this is corrected here!
> > 
> > The key must be that the multiplication by a power of three also
> > works for n with multiple trailing 1 bits.
> > 
> > The "x/2^alpha" part is of course just x>>alpha.
> > 
> > For an even n they still do a multiplication by 1 and a SHR 0, but
> > the code is branchless.
> >   
> >>    beta = ctz(n)
> >>    n = n / 2^beta
> >> until n < n_0  
> > 
> > I am guessing that the best way to do the n*3^alpha parts is a
> > table of powers of 3, so
> > 
> > let mut n = n0;
> > loop {
> >    n += 1;
> >    let alpha = ctz(n);
> >    n = (n * power_of_three[alpha]) >> alpha;
> >    n -= 1;
> >    beta = ctz(n);
> >    n >>= beta;
> >    if n <= n_0 {break};
> > }
> >   
> >>
> >> They are also playing some strange tricks with parallel solving
> >> that are not described clearly, or I am not clever enough to
> >> understand what they are doing. I would have to look at their
> >> code, I guess. 
> > :-)
> > 
> > Terje
> >   
> How many iterations are done in one loop?

Worst case - few thousands, or less. I didn't yet encounter 1000.
Mean - 6 or 7. May be, 8, but that's unlikely.

> I think it would need to be
> at least 10 on average to beat the simple approach. The simple
> approach only taking one clock per iteration. The 128-bit multiply
> and 128-bit count trailing zeros are slow in an FPGA.
> 
> And, can the approach consume a more limited number of iterations?
> For example counting only the 6 LSB zeros?
> 
> I found a mistake in my code handling groups. It can only do about a 
> 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4 
> million years?
> 

Much less.
My 1st estimate was for more stupid loop that always counts down to 1.


[toc] | [prev] | [next] | [standalone]


#115010

FromMichael S <already5chosen@yahoo.com>
Date2026-02-16 20:34 +0200
Message-ID<20260216203444.00005b54@yahoo.com>
In reply to#115006
On Mon, 16 Feb 2026 12:40:54 +0200
Michael S <already5chosen@yahoo.com> wrote:

> On Sun, 15 Feb 2026 21:53:10 -0500
> Robert Finch <robfi680@gmail.com> wrote:
> 
> > On 2026-02-15 4:59 p.m., Terje Mathisen wrote:  
> > > Thomas Koenig wrote:    
> > >> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> > >>    
> > >>> Just thinking about a way to replace ~N iterations by a MUL, ADD
> > >>> and SHR
> > >>>
> > >>> Even with 128-bit numbers, I would only need two MULs, two
> > >>> ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles?    
> > >>
> > >> https://link.springer.com/article/10.1007/s11227-025-07337-0
> > >> gives an improved algorithm, which they are using for the
> > >> supercomputer version.  They are doing:
> > >>
> > >> n = n0
> > >> repeat
> > >>    n = n + 1
> > >>    alpha = ctz(n)
> > >>    n = n * 3^alpha / 2^alpha    
> > > 
> > > Strange syntax...
> > > 
> > > ctz(n) must count trailing zeroes, right?
> > > If so, then ctz(n+1) will return 1 or more for odd n, so that
> > > gives 3n+3, which after the division by 2 returns a value which
> > > is one too high.   
> > >>    n = n - 1    
> > > 
> > > but this is corrected here!
> > > 
> > > The key must be that the multiplication by a power of three also
> > > works for n with multiple trailing 1 bits.
> > > 
> > > The "x/2^alpha" part is of course just x>>alpha.
> > > 
> > > For an even n they still do a multiplication by 1 and a SHR 0, but
> > > the code is branchless.
> > >     
> > >>    beta = ctz(n)
> > >>    n = n / 2^beta
> > >> until n < n_0    
> > > 
> > > I am guessing that the best way to do the n*3^alpha parts is a
> > > table of powers of 3, so
> > > 
> > > let mut n = n0;
> > > loop {
> > >    n += 1;
> > >    let alpha = ctz(n);
> > >    n = (n * power_of_three[alpha]) >> alpha;
> > >    n -= 1;
> > >    beta = ctz(n);
> > >    n >>= beta;
> > >    if n <= n_0 {break};
> > > }
> > >     
> > >>
> > >> They are also playing some strange tricks with parallel solving
> > >> that are not described clearly, or I am not clever enough to
> > >> understand what they are doing. I would have to look at their
> > >> code, I guess.   
> > > :-)
> > > 
> > > Terje
> > >     
> > How many iterations are done in one loop?  
> 
> Worst case - few thousands, or less. I didn't yet encounter 1000.
> Mean - 6 or 7. May be, 8, but that's unlikely.
> 

The above mean value is for completely non-filtered inputs. 
If we apply a very basic filtering, only looking at inputs in form 4*x+3
then an average # of steps is bigger - somewhere in range [15:20].

So, for filtered inputs higher-order steps start to make sense. 
Or not. It strongly depends on specific FPGA model.

> > I think it would need to be
> > at least 10 on average to beat the simple approach. The simple
> > approach only taking one clock per iteration. The 128-bit multiply
> > and 128-bit count trailing zeros are slow in an FPGA.
> > 
> > And, can the approach consume a more limited number of iterations?
> > For example counting only the 6 LSB zeros?
> > 
> > I found a mistake in my code handling groups. It can only do about
> > a 24x24=576 matrix of calculations in parallel @ 200MHz. So that's
> > 1/4 million years?
> >   
> 
> Much less.
> My 1st estimate was for more stupid loop that always counts down to 1.
> 
> 
> 

[toc] | [prev] | [next] | [standalone]


#115007

FromTerje Mathisen <terje.mathisen@tmsw.no>
Date2026-02-16 16:08 +0100
Message-ID<10mvbtk$vipb$1@dont-email.me>
In reply to#115004
Robert Finch wrote:
> On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
>> Thomas Koenig wrote:
>>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>>
>>>> Just thinking about a way to replace ~N iterations by a MUL, ADD and 
>>>> SHR
>>>>
>>>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC 
>>>> pairs
>>>> and a SHRD/SHR, so maybe 8-10 cycles?
>>>
>>> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
>>> an improved algorithm, which they are using for the supercomputer
>>> version.  They are doing:
>>>
>>> n = n0
>>> repeat
>>>    n = n + 1
>>>    alpha = ctz(n)
>>>    n = n * 3^alpha / 2^alpha
>>
>> Strange syntax...
>>
>> ctz(n) must count trailing zeroes, right?
>> If so, then ctz(n+1) will return 1 or more for odd n, so that gives 
>> 3n+3, which after the division by 2 returns a value which is one too 
>> high.
>>
>>>    n = n - 1
>>
>> but this is corrected here!
>>
>> The key must be that the multiplication by a power of three also works 
>> for n with multiple trailing 1 bits.
>>
>> The "x/2^alpha" part is of course just x>>alpha.
>>
>> For an even n they still do a multiplication by 1 and a SHR 0, but the 
>> code is branchless.
>>
>>>    beta = ctz(n)
>>>    n = n / 2^beta
>>> until n < n_0
>>
>> I am guessing that the best way to do the n*3^alpha parts is a table 
>> of powers of 3, so
>>
>> let mut n = n0;
>> loop {
>>  Â  n += 1;
>>  Â  let alpha = ctz(n);
>>  Â  n = (n * power_of_three[alpha]) >> alpha;
>>  Â  n -= 1;
>>  Â  beta = ctz(n);
>>  Â  n >>= beta;
>>  Â  if n <= n_0 {break};
>> }
>>
>>>
>>> They are also playing some strange tricks with parallel solving that
>>> are not described clearly, or I am not clever enough to understand
>>> what they are doing. I would have to look at their code, I guess.
>>>
>> :-)
>>
>> Terje
>>
> How many iterations are done in one loop? I think it would need to be at 
> least 10 on average to beat the simple approach. The simple approach 
> only taking one clock per iteration. The 128-bit multiply and 128-bit 
> count trailing zeros are slow in an FPGA.

Right.

The first part will handle trailing strings of 1's, while the second 
part does the same for all trailing zeroes after that initial 
mul_by_power_of_three.
> 
> And, can the approach consume a more limited number of iterations? For 
> example counting only the 6 LSB zeros?

It should work directly with a max(ctz,N) wrapper.
> 
> I found a mistake in my code handling groups. It can only do about a 
> 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4 
> million years?

FPGAs typically run at much, much lower frequencies than modern CPUs, so 
if the code above can average 3-5 iterations per loop, and do it in 
around 10 clock cycles, then the FPGA will be hard pressed to keep up?

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]


#115008

FromStephen Fuld <sfuld@alumni.cmu.edu.invalid>
Date2026-02-16 07:50 -0800
Message-ID<10mvecq$10h03$1@dont-email.me>
In reply to#114964
On 2/14/2026 1:57 AM, Thomas Koenig wrote:
> Some people are throwing large amounts of computing power at trying
> to disprove the Collatz conjecture.  Graphics cards have sped this
> up enormously.

big snip

> 
> The number of iterations would be tracked; if larger than a prevous
> record, the calculation would terminate with a corresponding signal
> to the outside world.
> 
> Also, each cycle, the result from the previous round would be
> compared against the input value.  If it is lower, the next number
> would be chosen.  Finally, the number reached would be checked against
> the previous, pre-set maximum, with would also signal to the outside
> world.

Hold these thoughts

snip

> The problem is embarassingly parallel. 


Sort of.  You could do it that way, but you would be giving up some 
performance.  Specifically, as you say, you want to check the number 
against the previous input value and the preset maximum.  But in a 
parallel implementation, you want to check against the largest value 
checked *across all the parallel threads*.  In order to do that, the 
threads need to communicate, thus violating the "embarrassingly 
parallel" rule.  I don't know how much performance you would give up so 
it may not make much difference, nor do I know the cost of keeping some 
"global" largest value checked so far.  There is a tradeoff here.





-- 
  - Stephen Fuld
(e-mail address disguised to prevent spam)

[toc] | [prev] | [next] | [standalone]


#115009

FromMichael S <already5chosen@yahoo.com>
Date2026-02-16 18:31 +0200
Message-ID<20260216183146.00007902@yahoo.com>
In reply to#115008
On Mon, 16 Feb 2026 07:50:49 -0800
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

> On 2/14/2026 1:57 AM, Thomas Koenig wrote:
> > Some people are throwing large amounts of computing power at trying
> > to disprove the Collatz conjecture.  Graphics cards have sped this
> > up enormously.  
> 
> big snip
> 
> > 
> > The number of iterations would be tracked; if larger than a prevous
> > record, the calculation would terminate with a corresponding signal
> > to the outside world.
> > 
> > Also, each cycle, the result from the previous round would be
> > compared against the input value.  If it is lower, the next number
> > would be chosen.  Finally, the number reached would be checked
> > against the previous, pre-set maximum, with would also signal to
> > the outside world.  
> 
> Hold these thoughts
> 
> snip
> 
> > The problem is embarassingly parallel.   
> 
> 
> Sort of.  You could do it that way, but you would be giving up some 
> performance.  Specifically, as you say, you want to check the number 
> against the previous input value and the preset maximum.  But in a 
> parallel implementation, you want to check against the largest value 
> checked *across all the parallel threads*.  In order to do that, the 
> threads need to communicate, thus violating the "embarrassingly 
> parallel" rule.  I don't know how much performance you would give up
> so it may not make much difference,

If initial values N0 are above 2**64 and the number of tests NP that run
in parallel is few millions or even few hundreds of millions then
performance gain achieved by comparison vs N0+NP instead of comparison
vs N0 is negligible.

> nor do I know the cost of keeping
> some "global" largest value checked so far.  There is a tradeoff here.
> 

[toc] | [prev] | [next] | [standalone]


Page 4 of 5 — ← Prev page 1 2 3 [4] 5  Next page →

Back to top | Article view | comp.arch


csiph-web