Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.arch > #114964 > unrolled thread
| Started by | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| First post | 2026-02-14 09:57 +0000 |
| Last post | 2026-02-18 21:56 +0100 |
| Articles | 20 on this page of 81 — 12 participants |
Back to article view | Back to comp.arch
A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-14 09:57 +0000
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 18:48 +0000
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-14 20:18 +0100
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 21:22 +0000
Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-14 13:30 -0800
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 12:50 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 13:48 -0600
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-21 21:40 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 15:17 -0600
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 04:50 -0600
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:01 -0700
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 14:43 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 12:38 -0600
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:51 +0100
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-26 01:27 +0000
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-26 09:21 +0100
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:11 -0700
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:58 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 12:27 -0600
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 15:14 +0100
Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-22 11:41 -0500
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:53 +0100
Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-23 10:52 -0500
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-02-14 21:19 -0800
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 03:58 -0500
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 13:14 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 15:46 +0000
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 10:56 -0500
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-15 19:16 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:40 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-26 19:31 +0000
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-18 18:22 +0200
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-18 22:56 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 12:10 +0200
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-21 01:56 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 19:23 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 20:36 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 10:46 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-22 12:45 +0100
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 16:35 +0200
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:12 -0500
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:47 -0500
Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-21 15:39 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 03:16 +0200
Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-22 14:48 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 23:25 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-19 11:51 +0100
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 23:12 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-20 15:11 +0100
Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-19 15:27 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-20 11:41 +0200
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 18:41 -0600
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-23 11:10 +0000
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-23 17:24 -0600
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 06:44 +0000
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 01:38 -0600
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 03:16 -0600
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 04:52 -0600
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 11:16 +0000
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 13:19 -0600
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:14 -0700
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-20 08:01 +0000
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 15:35 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 18:33 +0200
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:19 +0000
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:31 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 15:41 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 18:17 +0200
Re: A useless machine anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-02-15 22:41 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:52 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 16:45 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 18:01 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:42 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:59 +0100
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 21:53 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:40 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 20:34 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-16 16:08 +0100
Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-16 07:50 -0800
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 18:31 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-18 21:56 +0100
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2026-03-12 21:14 -0700 |
| Message-ID | <86h5qk74cs.fsf@linuxsc.com> |
| In reply to | #115122 |
Thomas Koenig <tkoenig@netcologne.de> writes: > BGB <cr88192@gmail.com> schrieb: > >> My own looking into this led me to realize that the number-space is >> modulo (relative to a cone with a spiral that repeats with every >> power-of-2), and that all the odd steps only ever increment the phase >> angle by 1, or fall down to a lower level (whenever the phase angle is >> not incremented). >> >> In effect, each odd step jumps to the opposite side of the cone, but >> slightly increments the angle. Since the phase-angle path can never move >> backwards, it can't touch any previous paths, thus, no loops can happen >> this way. >> >> >> The only way a cycle could exist is if (somehow) there were an >> impassable wall where pretty much *no* value could descend unless it >> hits the power-of-2 descent path, and then somehow miss this path as well. >> >> But, since "3*n+1" results in the minimum forward step, this means that >> one of the sides *has* to hit the power-of-2 descent cases at some point. > > The same argument would hold for the 3n-1 problem, where there is > more than one cycle. 3*n-1 = -(3*(-n)+1) So 3n-1 is just the same as 3n+1 on negative numbers.
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2026-02-20 08:01 +0000 |
| Message-ID | <10n94db$756i$1@dont-email.me> |
| In reply to | #115034 |
Michael S <already5chosen@yahoo.com> schrieb: > You should look for 3 sorts of mistakes > 1. Core itself. > The most likely mistake is failing to look for one of the ending > conditions. You have to look for 3 conditions: > a) - value produced at current step is smaller than initial value. > That is successful ending. Here you signal to the scheduler that you > are ready to accept the next value. > b) - overflow. Next value exceeds the width of register. > c) - timeout. You were running for more than N steps. N=8192 looks > like the reasonable value. Looking at the links in https://oeis.org/A006878 , anything above 2442 would be a new record, so 8192 would be plenty. -- This USENET posting was made without artificial intelligence, artificial impertinence, artificial arrogance, artificial stupidity, artificial flavorings or artificial colorants.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-15 15:35 +0200 |
| Message-ID | <20260215153526.00005240@yahoo.com> |
| In reply to | #114982 |
On Sun, 15 Feb 2026 03:58:28 -0500 Robert Finch <robfi680@gmail.com> wrote: > On 2026-02-14 4:57 a.m., Thomas Koenig wrote: > > Some people are throwing large amounts of computing power at trying > > to disprove the Collatz conjecture. Graphics cards have sped this > > up enormously. > > > > (A reminder, if any is needed: From a starting number n, the > > transformation, recursively applied, > > > > > > f(n) = 3*n+1, if n is odd; =n/2 if n is even. > > > > is conjectured to always reach 1 for any positive integer 1). > > > > All the work is done on general-purpose hardware, which has many > > capabilities that are not needed, and consume area and power that > > special-purpose hardware would not need. Also, the hardware > > is limited to 64-bit integers, and the range of tested numbers > > is now up to 2^71, so > > > > What could specialized hardware look like? > > > > The central part could be a unit which is handed a block of numbers > > to check, for example a range of 2^32 numbers. The current records > > for number of iterations and maximum number reached would also be > > stored in that unit. > > > > In that unit, the calculaiton would be performed by a combination > > of adder and shifter. The input would always be odd (so the last > > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1). > > > > The adder would compute (3*n+1)/2, and would be specialzed 128-bit > > adder. Why specialized? The logic is far less complex than a > > general-purpose adder. For example, a two-bit adder has the formula > > > > y1 = (!a1&a0) | (!c_in&a1&!a0); > > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0); > > > > and its propagate and generate bits are > > > > p = (a1); > > g = (a1&a0); > > > > The maximum number of trailing zeros after the (3*n+1)/2 step is > > three and can be determined cheaply from bits <3:1> of the input, > > so the input to the shifter (two levels only) can be provided > > in parallel. The machine would be designed that adding and > > shifting can be done in a single cycle. > > > > The number of iterations would be tracked; if larger than a prevous > > record, the calculation would terminate with a corresponding signal > > to the outside world. > > > > Also, each cycle, the result from the previous round would be > > compared against the input value. If it is lower, the next number > > would be chosen. Finally, the number reached would be checked > > against the previous, pre-set maximum, with would also signal to > > the outside world. > > > > Which numbers to chose? It is well known that only certain > > numbers with remainders modulo 2^n are interesting. The number > > can be seen on OEIS A076227. For example, modulo 2^10, only 64 > > numbers are "interesting", or 6.25% (but only eight bits would > > be needed to store them because the last two are always one). > > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the > > amount of numbers to be checked by half. One would use a small ROM, > > I presume. The number of bits would be a tradeoff between area and > > amount of compuation. My current guess would be that 2^10 could > > be a sweet spot. One third of the numbers would further be > > excluded by checking for divisiblity by three. > > > > The problem is embarassingly parallel. One could chose a size so > > each parcel is calculated in a few seconds or even minutes. to avoid > > having to avoid a large amount of communication with the outside. > > > > Because all units would be working all the time, and if one has > > a smaller unit one could simply build more of them, I suspect that > > power would be the main limiting factor. An adder design that > > might not be very fast, but uses lower power, low voltage and low > > frequency would be beneficial. > > > > Comments? > > > I wonder how difficult it would be to implement a parallel tester in > an FPGA. It looks simple enough and there are hundreds of DSP blocks > available to use. Could test in blocks of hundreds of numbers at a > time. Running at 100 MHz * 200 tests at a time = how long to get to > 2^71? > I am not totally sure what exactly you are asking, but if you are asking about how much time it takes at that rate to do 2**71 iterations then the answer is ~3743 years. But you probably already figured it out yourself. Of course, brute force demonstration of correction of conjecture for all numbers in range 1 to 2**71 will take more steps than 2**71. And it is unknown how much more. > I wonder if the calculation can be broken down further. Numbers are > all just defined by rules. I wonder if some calculus may apply. Limit > of function approaches one for a finite number of steps. There may be > a whole range of algorithms, is there one for the number two? Then > the number three? > > Occasionally I have made useless machinery. Great puzzle work. Made > the game of life in hardware. >
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-15 18:33 +0200 |
| Message-ID | <20260215183335.0000649e@yahoo.com> |
| In reply to | #114987 |
On Sun, 15 Feb 2026 15:35:26 +0200 Michael S <already5chosen@yahoo.com> wrote: > On Sun, 15 Feb 2026 03:58:28 -0500 > Robert Finch <robfi680@gmail.com> wrote: > > > On 2026-02-14 4:57 a.m., Thomas Koenig wrote: > > > Some people are throwing large amounts of computing power at > > > trying to disprove the Collatz conjecture. Graphics cards have > > > sped this up enormously. > > > > > > (A reminder, if any is needed: From a starting number n, the > > > transformation, recursively applied, > > > > > > > > > f(n) = 3*n+1, if n is odd; =n/2 if n is even. > > > > > > is conjectured to always reach 1 for any positive integer 1). > > > > > > All the work is done on general-purpose hardware, which has many > > > capabilities that are not needed, and consume area and power that > > > special-purpose hardware would not need. Also, the hardware > > > is limited to 64-bit integers, and the range of tested numbers > > > is now up to 2^71, so > > > > > > What could specialized hardware look like? > > > > > > The central part could be a unit which is handed a block of > > > numbers to check, for example a range of 2^32 numbers. The > > > current records for number of iterations and maximum number > > > reached would also be stored in that unit. > > > > > > In that unit, the calculaiton would be performed by a combination > > > of adder and shifter. The input would always be odd (so the last > > > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1). > > > > > > The adder would compute (3*n+1)/2, and would be specialzed 128-bit > > > adder. Why specialized? The logic is far less complex than a > > > general-purpose adder. For example, a two-bit adder has the > > > formula > > > > > > y1 = (!a1&a0) | (!c_in&a1&!a0); > > > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0); > > > > > > and its propagate and generate bits are > > > > > > p = (a1); > > > g = (a1&a0); > > > > > > The maximum number of trailing zeros after the (3*n+1)/2 step is > > > three and can be determined cheaply from bits <3:1> of the input, > > > so the input to the shifter (two levels only) can be provided > > > in parallel. The machine would be designed that adding and > > > shifting can be done in a single cycle. > > > > > > The number of iterations would be tracked; if larger than a > > > prevous record, the calculation would terminate with a > > > corresponding signal to the outside world. > > > > > > Also, each cycle, the result from the previous round would be > > > compared against the input value. If it is lower, the next number > > > would be chosen. Finally, the number reached would be checked > > > against the previous, pre-set maximum, with would also signal to > > > the outside world. > > > > > > Which numbers to chose? It is well known that only certain > > > numbers with remainders modulo 2^n are interesting. The number > > > can be seen on OEIS A076227. For example, modulo 2^10, only 64 > > > numbers are "interesting", or 6.25% (but only eight bits would > > > be needed to store them because the last two are always one). > > > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the > > > amount of numbers to be checked by half. One would use a small > > > ROM, I presume. The number of bits would be a tradeoff between > > > area and amount of compuation. My current guess would be that > > > 2^10 could be a sweet spot. One third of the numbers would > > > further be excluded by checking for divisiblity by three. > > > > > > The problem is embarassingly parallel. One could chose a size so > > > each parcel is calculated in a few seconds or even minutes. to > > > avoid having to avoid a large amount of communication with the > > > outside. > > > > > > Because all units would be working all the time, and if one has > > > a smaller unit one could simply build more of them, I suspect that > > > power would be the main limiting factor. An adder design that > > > might not be very fast, but uses lower power, low voltage and low > > > frequency would be beneficial. > > > > > > Comments? > > > > > I wonder how difficult it would be to implement a parallel tester in > > an FPGA. It looks simple enough and there are hundreds of DSP > > blocks available to use. Could test in blocks of hundreds of > > numbers at a time. Running at 100 MHz * 200 tests at a time = how > > long to get to 2^71? > > > > I am not totally sure what exactly you are asking, but if you are > asking about how much time it takes at that rate to do 2**71 > iterations then the answer is ~3743 years. > But you probably already figured it out yourself. > > Of course, brute force demonstration of correction of conjecture for > all numbers in range 1 to 2**71 will take more steps than 2**71. And > it is unknown how much more. > "Unknown" is a correct theoretically answer, but I can provide a very educated guess that on average it would take ~500 steps and the worst case would not exceed 2000 steps. So, bad news are that overall it will take more than a million years on your machine. And good news are that it'll take less than 2 million years.
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2026-02-15 19:19 +0000 |
| Message-ID | <10mt67k$996a$1@dont-email.me> |
| In reply to | #114994 |
Michael S <already5chosen@yahoo.com> schrieb: > "Unknown" is a correct theoretically answer, but I can provide a very > educated guess that on average it would take ~500 steps and the worst > case would not exceed 2000 steps. > So, bad news are that overall it will take more than a million years on > your machine. And good news are that it'll take less than 2 million > years. Not as bad, very many simplifications are possible. For one, you can discount all even numbers as starting values. You can go further: If you look at residues modulo 2^n of your starting numbers, there are only A076227(n) "interesting" remaiders that you need to look at. If you take n=39, only 3611535862 cases need to be looked at, so only 3611535862/2^39 of all cases need to be looked at, a reduction of a factor around 150. You can also do some more reductions (only 2/3 of those numbers need checking if you can calculate the remainder by 3 fast). There are different tradeoffs for different n, of course (a factor of 16 for n=10, for example). The main key is parallelization. -- This USENET posting was made without artificial intelligence, artificial impertinence, artificial arrogance, artificial stupidity, artificial flavorings or artificial colorants.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-16 12:31 +0200 |
| Message-ID | <20260216123131.00003e22@yahoo.com> |
| In reply to | #114999 |
On Sun, 15 Feb 2026 19:19:16 -0000 (UTC) Thomas Koenig <tkoenig@netcologne.de> wrote: > Michael S <already5chosen@yahoo.com> schrieb: > > > "Unknown" is a correct theoretically answer, but I can provide a > > very educated guess that on average it would take ~500 steps and > > the worst case would not exceed 2000 steps. > > So, bad news are that overall it will take more than a million > > years on your machine. And good news are that it'll take less than > > 2 million years. > > Not as bad, very many simplifications are possible. > > For one, you can discount all even numbers as starting values. O.k. > You can go further: If you look at residues modulo 2^n of your > starting numbers, there are only A076227(n) "interesting" remaiders > that you need to look at. Can not say that I understood. > If you take n=39, only 3611535862 cases > need to be looked at, so only 3611535862/2^39 of all cases need to > be looked at, a reduction of a factor around 150. You can also > do some more reductions (only 2/3 of those numbers need checking > if you can calculate the remainder by 3 fast). There are different > tradeoffs for different n, of course (a factor of 16 for n=10, > for example). > The estimate above was done for loop that finishes at 1. For brute-force disproof of Collatz conjecture it is unnecessary. We can stop at n < n0. I think, you mentioned it in your original post, but I didn't pay attention. Due to early exit mean number of iteration is cut from ~500 to 7 or at worst 8. > The main key is parallelization. And here is a problem. If one want to finish in, say, 1 month, one would need ALOT of parallelization. Even after all improvements we are talking about several millions of testing nodes. At such scale I can think about two main problems: 1. Dispatcher become a bottleneck 2. Some node will fail. It has to be detected and their jobs re-allocated. Which further complicates a scheduler. It does not sound particularly hard if somebody pays for engineering, esp. if the whole work done in one dedicated data center. Otherwise it does not sound easy. The worst part is that that there is nothing special about 2**71. Or about 2**80 for that matter. Tim Reintch already said that better than I can. The machine is not merely useless, it is pointless.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-17 15:41 +0200 |
| Message-ID | <20260217154123.00000257@yahoo.com> |
| In reply to | #115005 |
On Mon, 16 Feb 2026 12:31:31 +0200 Michael S <already5chosen@yahoo.com> wrote: > On Sun, 15 Feb 2026 19:19:16 -0000 (UTC) > Thomas Koenig <tkoenig@netcologne.de> wrote: > > > Michael S <already5chosen@yahoo.com> schrieb: > > > > > "Unknown" is a correct theoretically answer, but I can provide a > > > very educated guess that on average it would take ~500 steps and > > > the worst case would not exceed 2000 steps. > > > So, bad news are that overall it will take more than a million > > > years on your machine. And good news are that it'll take less than > > > 2 million years. > > > > Not as bad, very many simplifications are possible. > > > > For one, you can discount all even numbers as starting values. > > O.k. > > > You can go further: If you look at residues modulo 2^n of your > > starting numbers, there are only A076227(n) "interesting" remaiders > > that you need to look at. > > Can not say that I understood. > > > If you take n=39, only 3611535862 cases > > need to be looked at, so only 3611535862/2^39 of all cases need to > > be looked at, a reduction of a factor around 150. You can also > > do some more reductions (only 2/3 of those numbers need checking > > if you can calculate the remainder by 3 fast). There are different > > tradeoffs for different n, of course (a factor of 16 for n=10, > > for example). > > > Hopefully now I figured out what you are talking about, except of 2/3 part and apart from the meaning of A076227(n). I experimented a little with these ideas and it seems that reduction in number of steps is nowhere near the reduction in number of tested cases. My test vectors were 2**32 consecutive points starting at 2**52 or 2**53. I tried different modulo filters in range from 2**5 to 2**31. Two modes of filtering were used. In "simple" mode I just filtered out "non-interesting" inputs, but did nothing special about "interesting" inputs. In "advanced" mode I accelerated processing of "interesting" inputs by means of starting their testing from point x0= (x/M)*A[x%M]+B[x%M], where tables A[] and B[] were pre-calculated during filter build step. Obviously, "advanced" mode filtering requires ~3 times more memory plus multiplier + plus some logic. Here are my results: $ ./nsteps2 4503599627370496 0x100000000 Initial value = 4503599627370496 (0010000000000000). #values = 4294967296 (0000000100000000) Unfiltered : 22501638882 steps. 4294967296 tests. 5.239 steps/iter n*4+3 : 17132929762 steps. 1073741824 tests. 15.956 steps/iter n*2**5+m : 15924970210 steps. 939524096 tests. 16.950 steps/iter n*2**5+m * : 8408777442 steps. 939524096 tests. 8.950 steps/iter n*2**10+m : 11948770018 steps. 503316480 tests. 23.740 steps/iter n*2**10+m * : 7088288929 steps. 503316480 tests. 14.083 steps/iter n*2**15+m : 9003713250 steps. 289013760 tests. 31.153 steps/iter n*2**15+m * : 4818894105 steps. 289013760 tests. 16.674 steps/iter n*2**20+m : 6973854434 steps. 181379072 tests. 38.449 steps/iter n*2**20+m * : 3496505674 steps. 181379072 tests. 19.277 steps/iter n*2**25+m : 5574414306 steps. 125809408 tests. 44.308 steps/iter n*2**25+m * : 2543845543 steps. 125809408 tests. 20.220 steps/iter n*2**29+m : 4660796778 steps. 95710352 tests. 48.697 steps/iter n*2**29+m * : 2008917951 steps. 95710352 tests. 20.990 steps/iter n*2**30+m : 4493358006 steps. 89757100 tests. 50.061 steps/iter n*2**30+m * : 1915335033 steps. 89757100 tests. 21.339 steps/iter n*2**31+m : 4212222570 steps. 81419608 tests. 51.735 steps/iter n*2**31+m * : 1780554493 steps. 81419608 tests. 21.869 steps/iter $ ./nsteps2 0x20000000000000 0x100000000 Initial value = 9007199254740992 (0020000000000000). #values = 4294967296 (0000000100000000) Unfiltered : 22501540484 steps. 4294967296 tests. 5.239 steps/iter n*4+3 : 17132831364 steps. 1073741824 tests. 15.956 steps/iter n*2**5+m : 15924871812 steps. 939524096 tests. 16.950 steps/iter n*2**5+m * : 8408679044 steps. 939524096 tests. 8.950 steps/iter n*2**10+m : 11948671620 steps. 503316480 tests. 23.740 steps/iter n*2**10+m * : 7088271089 steps. 503316480 tests. 14.083 steps/iter n*2**15+m : 9003614852 steps. 289013760 tests. 31.153 steps/iter n*2**15+m * : 4818766141 steps. 289013760 tests. 16.673 steps/iter n*2**20+m : 6973756036 steps. 181379072 tests. 38.449 steps/iter n*2**20+m * : 3496390679 steps. 181379072 tests. 19.277 steps/iter n*2**25+m : 5574315908 steps. 125809408 tests. 44.308 steps/iter n*2**25+m * : 2543741535 steps. 125809408 tests. 20.219 steps/iter n*2**29+m : 4660698380 steps. 95710352 tests. 48.696 steps/iter n*2**29+m * : 2008935493 steps. 95710352 tests. 20.990 steps/iter n*2**30+m : 4493259608 steps. 89757100 tests. 50.060 steps/iter n*2**30+m * : 1915241920 steps. 89757100 tests. 21.338 steps/iter n*2**31+m : 4212124172 steps. 81419608 tests. 51.734 steps/iter n*2**31+m * : 1780581916 steps. 81419608 tests. 21.869 steps/iter 'Advanced' mode marked with '*' As you can see, 2**10 filter reduces # of processed points by factor of 8.5, but number of processing steps reduced only 1.9x in simple mode and 3.2x in advanced mode. For 2**20 filter ratios are, respectively, 23.7, 3.2 and 6.4 For 2**31 filter, 52.7, 5.3 and 12.6 Pay attention that for implementation in FPGA even filtering with modulo 2**29 (~12M entries) is impractical. Assuming advanced mode, 12M entries occupy 1000 Mbits. In non-expensive FPGA it simply does not fit. In very expensive FPGA it fits, but barely. Let's take, for example, Altera Stratix-10 GX 2800. By the standard of previous decade, it is a real behemoth with 2.75M Logic Elements (LEs). Assuming that each tester unit occupies 500 LEs there is room for 5500 units. Looking into my tables above, you want 1 filter per ~20 tester units, so 275 filters needed. And how many available? Just two! For modulo 2**20 it's a little better. Each filter consists of ~45,000 entries and occupies ~2.7 Mbits. So you can fit 80+ filters in GX 2800. But you want 5500 / 19 = 290, so still unbalanced. For modulo 2**15 (2205 entries, ~100 Kbits) there is enough memory to fit desired number of replica. But small filter like that is significantly less effective, reducing # of steps (in advanced mode) only by 4.7x. It is possible that there exist smarter algorithms for partition of work between parallel tester units that can greatly reduce need for access to filtering tables relatively to simple scheme that I assumed in above estimates. But I think that they point stands - this sort of filtering is not a silver bullet. It can improve throughput, but it's unlikely that it can improve it by more than one order of magnitude.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-17 18:17 +0200 |
| Message-ID | <20260217181710.00007342@yahoo.com> |
| In reply to | #115016 |
On Tue, 17 Feb 2026 15:41:23 +0200 Michael S <already5chosen@yahoo.com> wrote: > On Mon, 16 Feb 2026 12:31:31 +0200 > Michael S <already5chosen@yahoo.com> wrote: > > > On Sun, 15 Feb 2026 19:19:16 -0000 (UTC) > > Thomas Koenig <tkoenig@netcologne.de> wrote: > > > > > Michael S <already5chosen@yahoo.com> schrieb: > > > > > > > "Unknown" is a correct theoretically answer, but I can provide a > > > > very educated guess that on average it would take ~500 steps and > > > > the worst case would not exceed 2000 steps. > > > > So, bad news are that overall it will take more than a million > > > > years on your machine. And good news are that it'll take less > > > > than 2 million years. > > > > > > Not as bad, very many simplifications are possible. > > > > > > For one, you can discount all even numbers as starting values. > > > > O.k. > > > > > You can go further: If you look at residues modulo 2^n of your > > > starting numbers, there are only A076227(n) "interesting" > > > remaiders that you need to look at. > > > > Can not say that I understood. > > > > > If you take n=39, only 3611535862 cases > > > need to be looked at, so only 3611535862/2^39 of all cases need to > > > be looked at, a reduction of a factor around 150. You can also > > > do some more reductions (only 2/3 of those numbers need checking > > > if you can calculate the remainder by 3 fast). There are > > > different tradeoffs for different n, of course (a factor of 16 > > > for n=10, for example). > > > > > > > Hopefully now I figured out what you are talking about, except of 2/3 > part and apart from the meaning of A076227(n). > I experimented a little with these ideas and it seems that reduction > in number of steps is nowhere near the reduction in number of tested > cases. My test vectors were 2**32 consecutive points starting at > 2**52 or 2**53. I tried different modulo filters in range from 2**5 > to 2**31. Two modes of filtering were used. > In "simple" mode I just filtered out "non-interesting" inputs, but > did nothing special about "interesting" inputs. > In "advanced" mode I accelerated processing of "interesting" inputs by > means of starting their testing from point x0= (x/M)*A[x%M]+B[x%M], > where tables A[] and B[] were pre-calculated during filter build step. > Obviously, "advanced" mode filtering requires ~3 times more memory > plus multiplier + plus some logic. > > Here are my results: > $ ./nsteps2 4503599627370496 0x100000000 > Initial value = 4503599627370496 (0010000000000000). > #values = 4294967296 (0000000100000000) > Unfiltered : 22501638882 steps. 4294967296 tests. 5.239 steps/iter > n*4+3 : 17132929762 steps. 1073741824 tests. 15.956 steps/iter > n*2**5+m : 15924970210 steps. 939524096 tests. 16.950 steps/iter > n*2**5+m * : 8408777442 steps. 939524096 tests. 8.950 steps/iter > n*2**10+m : 11948770018 steps. 503316480 tests. 23.740 steps/iter > n*2**10+m * : 7088288929 steps. 503316480 tests. 14.083 steps/iter > n*2**15+m : 9003713250 steps. 289013760 tests. 31.153 steps/iter > n*2**15+m * : 4818894105 steps. 289013760 tests. 16.674 steps/iter > n*2**20+m : 6973854434 steps. 181379072 tests. 38.449 steps/iter > n*2**20+m * : 3496505674 steps. 181379072 tests. 19.277 steps/iter > n*2**25+m : 5574414306 steps. 125809408 tests. 44.308 steps/iter > n*2**25+m * : 2543845543 steps. 125809408 tests. 20.220 steps/iter > n*2**29+m : 4660796778 steps. 95710352 tests. 48.697 steps/iter > n*2**29+m * : 2008917951 steps. 95710352 tests. 20.990 steps/iter > n*2**30+m : 4493358006 steps. 89757100 tests. 50.061 steps/iter > n*2**30+m * : 1915335033 steps. 89757100 tests. 21.339 steps/iter > n*2**31+m : 4212222570 steps. 81419608 tests. 51.735 steps/iter > n*2**31+m * : 1780554493 steps. 81419608 tests. 21.869 steps/iter > > $ ./nsteps2 0x20000000000000 0x100000000 > Initial value = 9007199254740992 (0020000000000000). > #values = 4294967296 (0000000100000000) > Unfiltered : 22501540484 steps. 4294967296 tests. 5.239 steps/iter > n*4+3 : 17132831364 steps. 1073741824 tests. 15.956 steps/iter > n*2**5+m : 15924871812 steps. 939524096 tests. 16.950 steps/iter > n*2**5+m * : 8408679044 steps. 939524096 tests. 8.950 steps/iter > n*2**10+m : 11948671620 steps. 503316480 tests. 23.740 steps/iter > n*2**10+m * : 7088271089 steps. 503316480 tests. 14.083 steps/iter > n*2**15+m : 9003614852 steps. 289013760 tests. 31.153 steps/iter > n*2**15+m * : 4818766141 steps. 289013760 tests. 16.673 steps/iter > n*2**20+m : 6973756036 steps. 181379072 tests. 38.449 steps/iter > n*2**20+m * : 3496390679 steps. 181379072 tests. 19.277 steps/iter > n*2**25+m : 5574315908 steps. 125809408 tests. 44.308 steps/iter > n*2**25+m * : 2543741535 steps. 125809408 tests. 20.219 steps/iter > n*2**29+m : 4660698380 steps. 95710352 tests. 48.696 steps/iter > n*2**29+m * : 2008935493 steps. 95710352 tests. 20.990 steps/iter > n*2**30+m : 4493259608 steps. 89757100 tests. 50.060 steps/iter > n*2**30+m * : 1915241920 steps. 89757100 tests. 21.338 steps/iter > n*2**31+m : 4212124172 steps. 81419608 tests. 51.734 steps/iter > n*2**31+m * : 1780581916 steps. 81419608 tests. 21.869 steps/iter > 'Advanced' mode marked with '*' > > As you can see, 2**10 filter reduces # of processed points by factor > of 8.5, but number of processing steps reduced only 1.9x in simple > mode and 3.2x in advanced mode. > For 2**20 filter ratios are, respectively, 23.7, 3.2 and 6.4 > For 2**31 filter, 52.7, 5.3 and 12.6 > There was bug in my testing code. In reality filtering is somewhat more efficient. Here are corrected results: Initial value = 4503599627370496 (0010000000000000). #values = 4294967296 (0000000100000000) Unfiltered : 22501638882 steps. 4294967296 tests. 5.239 steps/iter n*4+3 : 17132929762 steps. 1073741824 tests. 15.956 steps/iter n*2**5+m : 13374833378 steps. 536870912 tests. 24.913 steps/iter n*2**5+m * : 8408777442 steps. 536870912 tests. 15.663 steps/iter n*2**10+m : 9935504098 steps. 268435456 tests. 37.013 steps/iter n*2**10+m * : 5187551970 steps. 268435456 tests. 19.325 steps/iter n*2**15+m : 7857750754 steps. 169738240 tests. 46.293 steps/iter n*2**15+m * : 3449537250 steps. 169738240 tests. 20.323 steps/iter n*2**20+m : 6242468578 steps. 111935488 tests. 55.768 steps/iter n*2**20+m * : 2410906338 steps. 111935488 tests. 21.538 steps/iter n*2**25+m : 4838650338 steps. 73364736 tests. 65.953 steps/iter n*2**25+m * : 1716934242 steps. 73364736 tests. 23.403 steps/iter n*2**29+m : 3856366938 steps. 51085096 tests. 75.489 steps/iter n*2**29+m * : 1338119106 steps. 51085096 tests. 26.194 steps/iter n*2**30+m : 3856366938 steps. 51085096 tests. 75.489 steps/iter n*2**30+m * : 1261491462 steps. 51085096 tests. 24.694 steps/iter n*2**31+m : 3666319938 steps. 47284156 tests. 77.538 steps/iter n*2**31+m * : 1184863818 steps. 47284156 tests. 25.058 steps/iter Initial value = 9007199254740992 (0020000000000000). #values = 4294967296 (0000000100000000) Unfiltered : 22501540484 steps. 4294967296 tests. 5.239 steps/iter n*4+3 : 17132831364 steps. 1073741824 tests. 15.956 steps/iter n*2**5+m : 13374734980 steps. 536870912 tests. 24.912 steps/iter n*2**5+m * : 8408679044 steps. 536870912 tests. 15.662 steps/iter n*2**10+m : 9935405700 steps. 268435456 tests. 37.012 steps/iter n*2**10+m * : 5187453572 steps. 268435456 tests. 19.325 steps/iter n*2**15+m : 7857652356 steps. 169738240 tests. 46.293 steps/iter n*2**15+m * : 3449438852 steps. 169738240 tests. 20.322 steps/iter n*2**20+m : 6242370180 steps. 111935488 tests. 55.768 steps/iter n*2**20+m * : 2410807940 steps. 111935488 tests. 21.537 steps/iter n*2**25+m : 4838551940 steps. 73364736 tests. 65.952 steps/iter n*2**25+m * : 1716835844 steps. 73364736 tests. 23.401 steps/iter n*2**29+m : 3856268540 steps. 51085096 tests. 75.487 steps/iter n*2**29+m * : 1338020708 steps. 51085096 tests. 26.192 steps/iter n*2**30+m : 3856268540 steps. 51085096 tests. 75.487 steps/iter n*2**30+m * : 1261393064 steps. 51085096 tests. 24.692 steps/iter n*2**31+m : 3666221540 steps. 47284156 tests. 77.536 steps/iter n*2**31+m * : 1184765420 steps. 47284156 tests. 25.056 steps/iter 2**10 filter reduces # of processed points by factor of 16.0, but number of processing steps reduced only 2.7x in simple mode and 4.3x in advanced mode. For 2**20 filter ratios are, respectively, 38.4, 3.6 and 9.3 For 2**31 filter: 90.8, 6.1 and 19.0 > Pay attention that for implementation in FPGA even filtering with > modulo 2**29 (~12M entries) is impractical. Assuming advanced mode, > 12M entries occupy 1000 Mbits. In non-expensive FPGA it simply does > not fit. In very expensive FPGA it fits, but barely. > > Let's take, for example, Altera Stratix-10 GX 2800. By the standard of > previous decade, it is a real behemoth with 2.75M Logic Elements > (LEs). Assuming that each tester unit occupies 500 LEs there is room > for 5500 units. Looking into my tables above, you want 1 filter per > ~20 tester units, so 275 filters needed. And how many available? Just > two! > And here I made even bigger mistake. The size of internal memory of GX 2800 is 229 Mbit. So 2**29 filter does not fit at all. > For modulo 2**20 it's a little better. Each filter consists of ~45,000 > entries and occupies ~2.7 Mbits. So you can fit 80+ filters in GX > 2800. But you want 5500 / 19 = 290, so still unbalanced. > Here correct numbers are as folowing: Each filter consists of 27,328 entries and occupies ~1.7 Mbits. So you can fit 130+ filters in GX 2800. But you want 5500 / 19 = 290, so still unbalanced. > For modulo 2**15 (2205 entries, ~100 Kbits) there is enough memory to > fit desired number of replica. But small filter like that is > significantly less effective, reducing # of steps (in advanced mode) > only by 4.7x. > Here it should be: For modulo 2**15 (1295 entries, ~60 Kbits) there is enough memory to fit desired number of replica. But small filter like that is significantly less effective, reducing # of steps (in advanced mode) only by 6.5x. > It is possible that there exist smarter algorithms for partition of > work between parallel tester units that can greatly reduce need for > access to filtering tables relatively to simple scheme that I assumed > in above estimates. But I think that they point stands - this sort of > filtering is not a silver bullet. It can improve throughput, but it's > unlikely that it can improve it by more than one order of magnitude. > Despite corrected mistakes my conclusions are the same.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2026-02-15 22:41 +0000 |
| Message-ID | <2026Feb15.234140@mips.complang.tuwien.ac.at> |
| In reply to | #114982 |
Robert Finch <robfi680@gmail.com> writes: >I wonder how difficult it would be to implement a parallel tester in an >FPGA. It looks simple enough and there are hundreds of DSP blocks >available to use. Could test in blocks of hundreds of numbers at a time. >Running at 100 MHz * 200 tests at a time = how long to get to 2^71? OTOH, with a PC with 8 cores running at 4GHZ, even with scalar operations you might outperform that. And with SIMD probably even more. The main problem is that support for 128-bit integers is not so great for scalar instructions and worse for SIMD instructions. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <terje.mathisen@tmsw.no> |
|---|---|
| Date | 2026-02-15 14:52 +0100 |
| Message-ID | <10msj2v$1va9$1@dont-email.me> |
| In reply to | #114964 |
Thomas Koenig wrote: > Some people are throwing large amounts of computing power at trying > to disprove the Collatz conjecture. Graphics cards have sped this > up enormously. > > (A reminder, if any is needed: From a starting number n, the > transformation, recursively applied, > > > f(n) = 3*n+1, if n is odd; =n/2 if n is even. > > is conjectured to always reach 1 for any positive integer 1). > > All the work is done on general-purpose hardware, which has many > capabilities that are not needed, and consume area and power that > special-purpose hardware would not need. Also, the hardware > is limited to 64-bit integers, and the range of tested numbers > is now up to 2^71, so > > What could specialized hardware look like? > > The central part could be a unit which is handed a block of numbers > to check, for example a range of 2^32 numbers. The current records > for number of iterations and maximum number reached would also be > stored in that unit. > > In that unit, the calculaiton would be performed by a combination > of adder and shifter. The input would always be odd (so the last > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1). > > The adder would compute (3*n+1)/2, and would be specialzed 128-bit > adder. Why specialized? The logic is far less complex than a > general-purpose adder. For example, a two-bit adder has the formula > > y1 = (!a1&a0) | (!c_in&a1&!a0); > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0); > > and its propagate and generate bits are > > p = (a1); > g = (a1&a0); > > The maximum number of trailing zeros after the (3*n+1)/2 step is > three and can be determined cheaply from bits <3:1> of the input, > so the input to the shifter (two levels only) can be provided > in parallel. The machine would be designed that adding and > shifting can be done in a single cycle. > > The number of iterations would be tracked; if larger than a prevous > record, the calculation would terminate with a corresponding signal > to the outside world. > > Also, each cycle, the result from the previous round would be > compared against the input value. If it is lower, the next number > would be chosen. Finally, the number reached would be checked against > the previous, pre-set maximum, with would also signal to the outside > world. > > Which numbers to chose? It is well known that only certain > numbers with remainders modulo 2^n are interesting. The number > can be seen on OEIS A076227. For example, modulo 2^10, only 64 > numbers are "interesting", or 6.25% (but only eight bits would > be needed to store them because the last two are always one). > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the > amount of numbers to be checked by half. One would use a small ROM, > I presume. The number of bits would be a tradeoff between area and > amount of compuation. My current guess would be that 2^10 could > be a sweet spot. One third of the numbers would further be > excluded by checking for divisiblity by three. > > The problem is embarassingly parallel. One could chose a size so > each parcel is calculated in a few seconds or even minutes. to avoid > having to avoid a large amount of communication with the outside. > > Because all units would be working all the time, and if one has > a smaller unit one could simply build more of them, I suspect that > power would be the main limiting factor. An adder design that > might not be very fast, but uses lower power, low voltage and low > frequency would be beneficial. > > Comments? > I wrote a 3np1.c program a long time ago, with manual handling of the 64-128 bit boundary. It used lookup tables with multiplication, addition and shift values to be applied, since the bottom N bits directly control what will happen on the next iterations. Worst case, if the bottom 8 bits are zero, then we'd do 8 times SHR 1, while at the other extreme we end up with 8 iterations of n+(n+1)/2, so approximately 1.5^8 which is around a factor of 20 growth over 16 iterations, right? I found my old code because I recently wrote a much simpler rust version of the same algorithm, this one was not branchless and stopped as soon as it found a number needing at least 64 bits. As I worked on it I had this very strong deja vue feeling. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2026-02-15 16:45 +0000 |
| Message-ID | <10mst7j$5pef$1@dont-email.me> |
| In reply to | #114988 |
Terje Mathisen <terje.mathisen@tmsw.no> schrieb: > I wrote a 3np1.c program a long time ago, with manual handling of the > 64-128 bit boundary. I believe a very large number of people have tried their hand at this :-) > It used lookup tables with multiplication, addition and shift values to > be applied, since the bottom N bits directly control what will happen on > the next iterations. > > Worst case, if the bottom 8 bits are zero, There was an error in my previous post; I thought that the maximum numbre of shifts was limited. This is not true, because 3*n+1 can be 2^n (obviously). So, the number is not necessarily limited, not be 3 (as I thought) but also not by 8. However, these numbers would occur quite rarely, so it could make sense to still limit the shifts to three, but just bypass the adder if the lowst number is still zero. > then we'd do 8 times SHR 1, > while at the other extreme we end up with 8 iterations of n+(n+1)/2, so > approximately 1.5^8 which is around a factor of 20 growth over 16 > iterations, right? But that number is equal to the number of trailing ones in binary, which can be larger than eight. [...] -- This USENET posting was made without artificial intelligence, artificial impertinence, artificial arrogance, artificial stupidity, artificial flavorings or artificial colorants.
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <terje.mathisen@tmsw.no> |
|---|---|
| Date | 2026-02-15 18:01 +0100 |
| Message-ID | <10msu5t$65om$1@dont-email.me> |
| In reply to | #114995 |
Thomas Koenig wrote: > Terje Mathisen <terje.mathisen@tmsw.no> schrieb: > >> I wrote a 3np1.c program a long time ago, with manual handling of the >> 64-128 bit boundary. > > I believe a very large number of people have tried their hand at this > :-) > >> It used lookup tables with multiplication, addition and shift values to >> be applied, since the bottom N bits directly control what will happen on >> the next iterations. >> >> Worst case, if the bottom 8 bits are zero, > > There was an error in my previous post; I thought that the maximum > numbre of shifts was limited. This is not true, because 3*n+1 can > be 2^n (obviously). So, the number is not necessarily limited, not be 3 > (as I thought) but also not by 8. > > However, these numbers would occur quite rarely, so it could make sense > to still limit the shifts to three, but just bypass the adder if > the lowst number is still zero. > >> then we'd do 8 times SHR 1, >> while at the other extreme we end up with 8 iterations of n+(n+1)/2, so >> approximately 1.5^8 which is around a factor of 20 growth over 16 >> iterations, right? > > But that number is equal to the number of trailing ones in binary, > which can be larger than eight. Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles? Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2026-02-15 19:42 +0000 |
| Message-ID | <10mt7jn$996a$2@dont-email.me> |
| In reply to | #114997 |
Terje Mathisen <terje.mathisen@tmsw.no> schrieb: > Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR > > Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs > and a SHRD/SHR, so maybe 8-10 cycles? https://link.springer.com/article/10.1007/s11227-025-07337-0 gives an improved algorithm, which they are using for the supercomputer version. They are doing: n = n0 repeat n = n + 1 alpha = ctz(n) n = n * 3^alpha / 2^alpha n = n - 1 beta = ctz(n) n = n / 2^beta until n < n_0 They are also playing some strange tricks with parallel solving that are not described clearly, or I am not clever enough to understand what they are doing. I would have to look at their code, I guess. -- This USENET posting was made without artificial intelligence, artificial impertinence, artificial arrogance, artificial stupidity, artificial flavorings or artificial colorants.
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <terje.mathisen@tmsw.no> |
|---|---|
| Date | 2026-02-15 22:59 +0100 |
| Message-ID | <10mtfjo$ck6h$1@dont-email.me> |
| In reply to | #115000 |
Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>
>> Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR
>>
>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs
>> and a SHRD/SHR, so maybe 8-10 cycles?
>
> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
> an improved algorithm, which they are using for the supercomputer
> version. They are doing:
>
> n = n0
> repeat
> n = n + 1
> alpha = ctz(n)
> n = n * 3^alpha / 2^alpha
Strange syntax...
ctz(n) must count trailing zeroes, right?
If so, then ctz(n+1) will return 1 or more for odd n, so that gives
3n+3, which after the division by 2 returns a value which is one too high.
> n = n - 1
but this is corrected here!
The key must be that the multiplication by a power of three also works
for n with multiple trailing 1 bits.
The "x/2^alpha" part is of course just x>>alpha.
For an even n they still do a multiplication by 1 and a SHR 0, but the
code is branchless.
> beta = ctz(n)
> n = n / 2^beta
> until n < n_0
I am guessing that the best way to do the n*3^alpha parts is a table of
powers of 3, so
let mut n = n0;
loop {
n += 1;
let alpha = ctz(n);
n = (n * power_of_three[alpha]) >> alpha;
n -= 1;
beta = ctz(n);
n >>= beta;
if n <= n_0 {break};
}
>
> They are also playing some strange tricks with parallel solving that
> are not described clearly, or I am not clever enough to understand
> what they are doing. I would have to look at their code, I guess.
>
:-)
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Robert Finch <robfi680@gmail.com> |
|---|---|
| Date | 2026-02-15 21:53 -0500 |
| Message-ID | <10mu0qp$hqfc$1@dont-email.me> |
| In reply to | #115002 |
On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
> Thomas Koenig wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>
>>> Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR
>>>
>>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs
>>> and a SHRD/SHR, so maybe 8-10 cycles?
>>
>> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
>> an improved algorithm, which they are using for the supercomputer
>> version. They are doing:
>>
>> n = n0
>> repeat
>> n = n + 1
>> alpha = ctz(n)
>> n = n * 3^alpha / 2^alpha
>
> Strange syntax...
>
> ctz(n) must count trailing zeroes, right?
> If so, then ctz(n+1) will return 1 or more for odd n, so that gives
> 3n+3, which after the division by 2 returns a value which is one too high.
>
>> n = n - 1
>
> but this is corrected here!
>
> The key must be that the multiplication by a power of three also works
> for n with multiple trailing 1 bits.
>
> The "x/2^alpha" part is of course just x>>alpha.
>
> For an even n they still do a multiplication by 1 and a SHR 0, but the
> code is branchless.
>
>> beta = ctz(n)
>> n = n / 2^beta
>> until n < n_0
>
> I am guessing that the best way to do the n*3^alpha parts is a table of
> powers of 3, so
>
> let mut n = n0;
> loop {
> n += 1;
> let alpha = ctz(n);
> n = (n * power_of_three[alpha]) >> alpha;
> n -= 1;
> beta = ctz(n);
> n >>= beta;
> if n <= n_0 {break};
> }
>
>>
>> They are also playing some strange tricks with parallel solving that
>> are not described clearly, or I am not clever enough to understand
>> what they are doing. I would have to look at their code, I guess.
>>
> :-)
>
> Terje
>
How many iterations are done in one loop? I think it would need to be at
least 10 on average to beat the simple approach. The simple approach
only taking one clock per iteration. The 128-bit multiply and 128-bit
count trailing zeros are slow in an FPGA.
And, can the approach consume a more limited number of iterations? For
example counting only the 6 LSB zeros?
I found a mistake in my code handling groups. It can only do about a
24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4
million years?
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-16 12:40 +0200 |
| Message-ID | <20260216124054.000063b6@yahoo.com> |
| In reply to | #115004 |
On Sun, 15 Feb 2026 21:53:10 -0500
Robert Finch <robfi680@gmail.com> wrote:
> On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
> > Thomas Koenig wrote:
> >> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> >>
> >>> Just thinking about a way to replace ~N iterations by a MUL, ADD
> >>> and SHR
> >>>
> >>> Even with 128-bit numbers, I would only need two MULs, two
> >>> ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles?
> >>
> >> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
> >> an improved algorithm, which they are using for the supercomputer
> >> version. They are doing:
> >>
> >> n = n0
> >> repeat
> >> n = n + 1
> >> alpha = ctz(n)
> >> n = n * 3^alpha / 2^alpha
> >
> > Strange syntax...
> >
> > ctz(n) must count trailing zeroes, right?
> > If so, then ctz(n+1) will return 1 or more for odd n, so that gives
> > 3n+3, which after the division by 2 returns a value which is one
> > too high.
> >> n = n - 1
> >
> > but this is corrected here!
> >
> > The key must be that the multiplication by a power of three also
> > works for n with multiple trailing 1 bits.
> >
> > The "x/2^alpha" part is of course just x>>alpha.
> >
> > For an even n they still do a multiplication by 1 and a SHR 0, but
> > the code is branchless.
> >
> >> beta = ctz(n)
> >> n = n / 2^beta
> >> until n < n_0
> >
> > I am guessing that the best way to do the n*3^alpha parts is a
> > table of powers of 3, so
> >
> > let mut n = n0;
> > loop {
> > n += 1;
> > let alpha = ctz(n);
> > n = (n * power_of_three[alpha]) >> alpha;
> > n -= 1;
> > beta = ctz(n);
> > n >>= beta;
> > if n <= n_0 {break};
> > }
> >
> >>
> >> They are also playing some strange tricks with parallel solving
> >> that are not described clearly, or I am not clever enough to
> >> understand what they are doing. I would have to look at their
> >> code, I guess.
> > :-)
> >
> > Terje
> >
> How many iterations are done in one loop?
Worst case - few thousands, or less. I didn't yet encounter 1000.
Mean - 6 or 7. May be, 8, but that's unlikely.
> I think it would need to be
> at least 10 on average to beat the simple approach. The simple
> approach only taking one clock per iteration. The 128-bit multiply
> and 128-bit count trailing zeros are slow in an FPGA.
>
> And, can the approach consume a more limited number of iterations?
> For example counting only the 6 LSB zeros?
>
> I found a mistake in my code handling groups. It can only do about a
> 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4
> million years?
>
Much less.
My 1st estimate was for more stupid loop that always counts down to 1.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-16 20:34 +0200 |
| Message-ID | <20260216203444.00005b54@yahoo.com> |
| In reply to | #115006 |
On Mon, 16 Feb 2026 12:40:54 +0200
Michael S <already5chosen@yahoo.com> wrote:
> On Sun, 15 Feb 2026 21:53:10 -0500
> Robert Finch <robfi680@gmail.com> wrote:
>
> > On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
> > > Thomas Koenig wrote:
> > >> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> > >>
> > >>> Just thinking about a way to replace ~N iterations by a MUL, ADD
> > >>> and SHR
> > >>>
> > >>> Even with 128-bit numbers, I would only need two MULs, two
> > >>> ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles?
> > >>
> > >> https://link.springer.com/article/10.1007/s11227-025-07337-0
> > >> gives an improved algorithm, which they are using for the
> > >> supercomputer version. They are doing:
> > >>
> > >> n = n0
> > >> repeat
> > >> n = n + 1
> > >> alpha = ctz(n)
> > >> n = n * 3^alpha / 2^alpha
> > >
> > > Strange syntax...
> > >
> > > ctz(n) must count trailing zeroes, right?
> > > If so, then ctz(n+1) will return 1 or more for odd n, so that
> > > gives 3n+3, which after the division by 2 returns a value which
> > > is one too high.
> > >> n = n - 1
> > >
> > > but this is corrected here!
> > >
> > > The key must be that the multiplication by a power of three also
> > > works for n with multiple trailing 1 bits.
> > >
> > > The "x/2^alpha" part is of course just x>>alpha.
> > >
> > > For an even n they still do a multiplication by 1 and a SHR 0, but
> > > the code is branchless.
> > >
> > >> beta = ctz(n)
> > >> n = n / 2^beta
> > >> until n < n_0
> > >
> > > I am guessing that the best way to do the n*3^alpha parts is a
> > > table of powers of 3, so
> > >
> > > let mut n = n0;
> > > loop {
> > > n += 1;
> > > let alpha = ctz(n);
> > > n = (n * power_of_three[alpha]) >> alpha;
> > > n -= 1;
> > > beta = ctz(n);
> > > n >>= beta;
> > > if n <= n_0 {break};
> > > }
> > >
> > >>
> > >> They are also playing some strange tricks with parallel solving
> > >> that are not described clearly, or I am not clever enough to
> > >> understand what they are doing. I would have to look at their
> > >> code, I guess.
> > > :-)
> > >
> > > Terje
> > >
> > How many iterations are done in one loop?
>
> Worst case - few thousands, or less. I didn't yet encounter 1000.
> Mean - 6 or 7. May be, 8, but that's unlikely.
>
The above mean value is for completely non-filtered inputs.
If we apply a very basic filtering, only looking at inputs in form 4*x+3
then an average # of steps is bigger - somewhere in range [15:20].
So, for filtered inputs higher-order steps start to make sense.
Or not. It strongly depends on specific FPGA model.
> > I think it would need to be
> > at least 10 on average to beat the simple approach. The simple
> > approach only taking one clock per iteration. The 128-bit multiply
> > and 128-bit count trailing zeros are slow in an FPGA.
> >
> > And, can the approach consume a more limited number of iterations?
> > For example counting only the 6 LSB zeros?
> >
> > I found a mistake in my code handling groups. It can only do about
> > a 24x24=576 matrix of calculations in parallel @ 200MHz. So that's
> > 1/4 million years?
> >
>
> Much less.
> My 1st estimate was for more stupid loop that always counts down to 1.
>
>
>
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <terje.mathisen@tmsw.no> |
|---|---|
| Date | 2026-02-16 16:08 +0100 |
| Message-ID | <10mvbtk$vipb$1@dont-email.me> |
| In reply to | #115004 |
Robert Finch wrote:
> On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
>> Thomas Koenig wrote:
>>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>>
>>>> Just thinking about a way to replace ~N iterations by a MUL, ADD and
>>>> SHR
>>>>
>>>> Even with 128-bit numbers, I would only need two MULs, two ADD/ADC
>>>> pairs
>>>> and a SHRD/SHR, so maybe 8-10 cycles?
>>>
>>> https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
>>> an improved algorithm, which they are using for the supercomputer
>>> version. They are doing:
>>>
>>> n = n0
>>> repeat
>>> Â Â n = n + 1
>>> Â Â alpha = ctz(n)
>>> Â Â n = n * 3^alpha / 2^alpha
>>
>> Strange syntax...
>>
>> ctz(n) must count trailing zeroes, right?
>> If so, then ctz(n+1) will return 1 or more for odd n, so that gives
>> 3n+3, which after the division by 2 returns a value which is one too
>> high.
>>
>>> Â Â n = n - 1
>>
>> but this is corrected here!
>>
>> The key must be that the multiplication by a power of three also works
>> for n with multiple trailing 1 bits.
>>
>> The "x/2^alpha" part is of course just x>>alpha.
>>
>> For an even n they still do a multiplication by 1 and a SHR 0, but the
>> code is branchless.
>>
>>> Â Â beta = ctz(n)
>>> Â Â n = n / 2^beta
>>> until n < n_0
>>
>> I am guessing that the best way to do the n*3^alpha parts is a table
>> of powers of 3, so
>>
>> let mut n = n0;
>> loop {
>> Â n += 1;
>> Â let alpha = ctz(n);
>> Â n = (n * power_of_three[alpha]) >> alpha;
>> Â n -= 1;
>> Â beta = ctz(n);
>> Â n >>= beta;
>> Â if n <= n_0 {break};
>> }
>>
>>>
>>> They are also playing some strange tricks with parallel solving that
>>> are not described clearly, or I am not clever enough to understand
>>> what they are doing. I would have to look at their code, I guess.
>>>
>> :-)
>>
>> Terje
>>
> How many iterations are done in one loop? I think it would need to be at
> least 10 on average to beat the simple approach. The simple approach
> only taking one clock per iteration. The 128-bit multiply and 128-bit
> count trailing zeros are slow in an FPGA.
Right.
The first part will handle trailing strings of 1's, while the second
part does the same for all trailing zeroes after that initial
mul_by_power_of_three.
>
> And, can the approach consume a more limited number of iterations? For
> example counting only the 6 LSB zeros?
It should work directly with a max(ctz,N) wrapper.
>
> I found a mistake in my code handling groups. It can only do about a
> 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4
> million years?
FPGAs typically run at much, much lower frequencies than modern CPUs, so
if the code above can average 3-5 iterations per loop, and do it in
around 10 clock cycles, then the FPGA will be hard pressed to keep up?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Stephen Fuld <sfuld@alumni.cmu.edu.invalid> |
|---|---|
| Date | 2026-02-16 07:50 -0800 |
| Message-ID | <10mvecq$10h03$1@dont-email.me> |
| In reply to | #114964 |
On 2/14/2026 1:57 AM, Thomas Koenig wrote: > Some people are throwing large amounts of computing power at trying > to disprove the Collatz conjecture. Graphics cards have sped this > up enormously. big snip > > The number of iterations would be tracked; if larger than a prevous > record, the calculation would terminate with a corresponding signal > to the outside world. > > Also, each cycle, the result from the previous round would be > compared against the input value. If it is lower, the next number > would be chosen. Finally, the number reached would be checked against > the previous, pre-set maximum, with would also signal to the outside > world. Hold these thoughts snip > The problem is embarassingly parallel. Sort of. You could do it that way, but you would be giving up some performance. Specifically, as you say, you want to check the number against the previous input value and the preset maximum. But in a parallel implementation, you want to check against the largest value checked *across all the parallel threads*. In order to do that, the threads need to communicate, thus violating the "embarrassingly parallel" rule. I don't know how much performance you would give up so it may not make much difference, nor do I know the cost of keeping some "global" largest value checked so far. There is a tradeoff here. -- - Stephen Fuld (e-mail address disguised to prevent spam)
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-16 18:31 +0200 |
| Message-ID | <20260216183146.00007902@yahoo.com> |
| In reply to | #115008 |
On Mon, 16 Feb 2026 07:50:49 -0800 Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote: > On 2/14/2026 1:57 AM, Thomas Koenig wrote: > > Some people are throwing large amounts of computing power at trying > > to disprove the Collatz conjecture. Graphics cards have sped this > > up enormously. > > big snip > > > > > The number of iterations would be tracked; if larger than a prevous > > record, the calculation would terminate with a corresponding signal > > to the outside world. > > > > Also, each cycle, the result from the previous round would be > > compared against the input value. If it is lower, the next number > > would be chosen. Finally, the number reached would be checked > > against the previous, pre-set maximum, with would also signal to > > the outside world. > > Hold these thoughts > > snip > > > The problem is embarassingly parallel. > > > Sort of. You could do it that way, but you would be giving up some > performance. Specifically, as you say, you want to check the number > against the previous input value and the preset maximum. But in a > parallel implementation, you want to check against the largest value > checked *across all the parallel threads*. In order to do that, the > threads need to communicate, thus violating the "embarrassingly > parallel" rule. I don't know how much performance you would give up > so it may not make much difference, If initial values N0 are above 2**64 and the number of tests NP that run in parallel is few millions or even few hundreds of millions then performance gain achieved by comparison vs N0+NP instead of comparison vs N0 is negligible. > nor do I know the cost of keeping > some "global" largest value checked so far. There is a tradeoff here. >
[toc] | [prev] | [next] | [standalone]
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
Back to top | Article view | comp.arch
csiph-web