Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.arch > #114964 > unrolled thread
| Started by | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| First post | 2026-02-14 09:57 +0000 |
| Last post | 2026-02-18 21:56 +0100 |
| Articles | 20 on this page of 81 — 12 participants |
Back to article view | Back to comp.arch
A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-14 09:57 +0000
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 18:48 +0000
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-14 20:18 +0100
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-14 21:22 +0000
Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-14 13:30 -0800
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 12:50 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 13:48 -0600
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-21 21:40 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 15:17 -0600
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 04:50 -0600
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:01 -0700
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 14:43 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 12:38 -0600
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:51 +0100
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-26 01:27 +0000
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-26 09:21 +0100
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:11 -0700
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:58 +0100
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-21 12:27 -0600
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-22 15:14 +0100
Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-22 11:41 -0500
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-23 10:53 +0100
Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-23 10:52 -0500
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-02-14 21:19 -0800
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 03:58 -0500
Re: A useless machine David Brown <david.brown@hesbynett.no> - 2026-02-15 13:14 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 15:46 +0000
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 10:56 -0500
Re: A useless machine MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-02-15 19:16 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:40 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-26 19:31 +0000
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-18 18:22 +0200
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-18 22:56 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 12:10 +0200
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-21 01:56 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 19:23 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-21 20:36 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 10:46 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-22 12:45 +0100
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 16:35 +0200
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:12 -0500
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-22 19:47 -0500
Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-21 15:39 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 03:16 +0200
Re: A useless machine EricP <ThatWouldBeTelling@thevillage.com> - 2026-02-22 14:48 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-22 23:25 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-19 11:51 +0100
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-19 23:12 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-20 15:11 +0100
Re: A useless machine Stefan Monnier <monnier@iro.umontreal.ca> - 2026-02-19 15:27 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-20 11:41 +0200
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-22 18:41 -0600
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-23 11:10 +0000
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-23 17:24 -0600
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 06:44 +0000
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 01:38 -0600
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 03:16 -0600
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 04:52 -0600
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-24 11:16 +0000
Re: A useless machine BGB <cr88192@gmail.com> - 2026-02-24 13:19 -0600
Re: A useless machine Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-03-12 21:14 -0700
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-20 08:01 +0000
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 15:35 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-15 18:33 +0200
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:19 +0000
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:31 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 15:41 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-17 18:17 +0200
Re: A useless machine anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-02-15 22:41 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 14:52 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 16:45 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 18:01 +0100
Re: A useless machine Thomas Koenig <tkoenig@netcologne.de> - 2026-02-15 19:42 +0000
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-15 22:59 +0100
Re: A useless machine Robert Finch <robfi680@gmail.com> - 2026-02-15 21:53 -0500
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 12:40 +0200
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 20:34 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-16 16:08 +0100
Re: A useless machine Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-02-16 07:50 -0800
Re: A useless machine Michael S <already5chosen@yahoo.com> - 2026-02-16 18:31 +0200
Re: A useless machine Terje Mathisen <terje.mathisen@tmsw.no> - 2026-02-18 21:56 +0100
Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →
| From | Stefan Monnier <monnier@iro.umontreal.ca> |
|---|---|
| Date | 2026-02-22 11:41 -0500 |
| Message-ID | <jwvqzqcd98o.fsf-monnier+comp.arch@gnu.org> |
| In reply to | #114970 |
> It is a mathematically very interesting problem. Whether there is anything > to be gained by continuing the search for exceptions by brute force, is > a different matter - but /if/ it is worth doing, then doing so more > efficiently is a good idea. AFAIK, in such games "more efficiently" tends to me largely irrelevant if it's more efficient by a (non-astronomical) constant factor. Instead, you need it to be algorithmically more efficient. === Stefan
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david.brown@hesbynett.no> |
|---|---|
| Date | 2026-02-23 10:53 +0100 |
| Message-ID | <10nh82q$2sbtq$2@dont-email.me> |
| In reply to | #115094 |
On 22/02/2026 17:41, Stefan Monnier wrote: >> It is a mathematically very interesting problem. Whether there is anything >> to be gained by continuing the search for exceptions by brute force, is >> a different matter - but /if/ it is worth doing, then doing so more >> efficiently is a good idea. > > AFAIK, in such games "more efficiently" tends to me largely irrelevant > if it's more efficient by a (non-astronomical) constant factor. > Instead, you need it to be algorithmically more efficient. > Sure - algorithmic improvements usually (but not always in practice) lead to the biggest efficiency gains. But if you are doing a "test every number up to N" check of a conjecture, /any/ efficiency improvements are helpful.
[toc] | [prev] | [next] | [standalone]
| From | Stefan Monnier <monnier@iro.umontreal.ca> |
|---|---|
| Date | 2026-02-23 10:52 -0500 |
| Message-ID | <jwv4in7a27d.fsf-monnier+comp.arch@gnu.org> |
| In reply to | #115110 |
David Brown [2026-02-23 10:53:30] wrote: > Sure - algorithmic improvements usually (but not always in practice) > lead to the biggest efficiency gains. But if you are doing a "test > every number up to N" check of a conjecture, /any/ efficiency > improvements are helpful. But once you get to 2^70, the exponential explosion means that constant-factor improvements are vanishingly helpful. === Stefan
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2026-02-14 21:19 -0800 |
| Message-ID | <867bseh9c6.fsf@linuxsc.com> |
| In reply to | #114964 |
Thomas Koenig <tkoenig@netcologne.de> writes: > Some people are throwing large amounts of computing power at trying > to disprove the Collatz conjecture. Graphics cards have sped this > up enormously. > > (A reminder, if any is needed: From a starting number n, the > transformation, recursively applied, > > > f(n) = 3*n+1, if n is odd; =n/2 if n is even. > > is conjectured to always reach 1 for any positive integer 1). > > All the work is done on general-purpose hardware, which has many > capabilities that are not needed, and consume area and power that > special-purpose hardware would not need. Also, the hardware > is limited to 64-bit integers, and the range of tested numbers > is now up to 2^71, so > > What could specialized hardware look like? > > The central part could be a unit which is handed a block of numbers > to check, for example a range of 2^32 numbers. The current records > for number of iterations and maximum number reached would also be > stored in that unit. > > In that unit, the calculaiton would be performed by a combination > of adder and shifter. The input would always be odd (so the last > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1). > > The adder would compute (3*n+1)/2, and would be specialzed 128-bit > adder. Why specialized? The logic is far less complex than a > general-purpose adder. For example, a two-bit adder has the formula > > y1 = (!a1&a0) | (!c_in&a1&!a0); > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0); > > and its propagate and generate bits are > > p = (a1); > g = (a1&a0); > > The maximum number of trailing zeros after the (3*n+1)/2 step is > three and can be determined cheaply from bits <3:1> of the input, > so the input to the shifter (two levels only) can be provided > in parallel. The machine would be designed that adding and > shifting can be done in a single cycle. > > The number of iterations would be tracked; if larger than a prevous > record, the calculation would terminate with a corresponding signal > to the outside world. > > Also, each cycle, the result from the previous round would be > compared against the input value. If it is lower, the next number > would be chosen. Finally, the number reached would be checked against > the previous, pre-set maximum, with would also signal to the outside > world. > > Which numbers to chose? It is well known that only certain > numbers with remainders modulo 2^n are interesting. The number > can be seen on OEIS A076227. For example, modulo 2^10, only 64 > numbers are "interesting", or 6.25% (but only eight bits would > be needed to store them because the last two are always one). > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the > amount of numbers to be checked by half. One would use a small ROM, > I presume. The number of bits would be a tradeoff between area and > amount of compuation. My current guess would be that 2^10 could > be a sweet spot. One third of the numbers would further be > excluded by checking for divisiblity by three. > > The problem is embarassingly parallel. One could chose a size so > each parcel is calculated in a few seconds or even minutes. to avoid > having to avoid a large amount of communication with the outside. > > Because all units would be working all the time, and if one has > a smaller unit one could simply build more of them, I suspect that > power would be the main limiting factor. An adder design that > might not be very fast, but uses lower power, low voltage and low > frequency would be beneficial. > > Comments? Amusing. I predict that ultimately it will be a waste of effort because the Collatz conjecture is true even though currently unproven. Also it seems likely that no mathematical insights will result from any sort of brute force approach to disprove it, even if the "brute force" scheme is selective. If anyone is interested in looking into questions regarding the Collatz conjecture, I suggest considering a generalization f[k](n) = n/2 if n is even, 3n+k if n is odd, for an fixed odd k Empirically you should quickly find There is always a cycle starting at k. (Duh!) There is only one cycle if k is a power of 3. (This result follows from the Collatz conjecture.) If k is not a power of 3, there are always at least two cycles, and can be more.
[toc] | [prev] | [next] | [standalone]
| From | Robert Finch <robfi680@gmail.com> |
|---|---|
| Date | 2026-02-15 03:58 -0500 |
| Message-ID | <10ms1rm$3s9f4$1@dont-email.me> |
| In reply to | #114964 |
On 2026-02-14 4:57 a.m., Thomas Koenig wrote: > Some people are throwing large amounts of computing power at trying > to disprove the Collatz conjecture. Graphics cards have sped this > up enormously. > > (A reminder, if any is needed: From a starting number n, the > transformation, recursively applied, > > > f(n) = 3*n+1, if n is odd; =n/2 if n is even. > > is conjectured to always reach 1 for any positive integer 1). > > All the work is done on general-purpose hardware, which has many > capabilities that are not needed, and consume area and power that > special-purpose hardware would not need. Also, the hardware > is limited to 64-bit integers, and the range of tested numbers > is now up to 2^71, so > > What could specialized hardware look like? > > The central part could be a unit which is handed a block of numbers > to check, for example a range of 2^32 numbers. The current records > for number of iterations and maximum number reached would also be > stored in that unit. > > In that unit, the calculaiton would be performed by a combination > of adder and shifter. The input would always be odd (so the last > bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1). > > The adder would compute (3*n+1)/2, and would be specialzed 128-bit > adder. Why specialized? The logic is far less complex than a > general-purpose adder. For example, a two-bit adder has the formula > > y1 = (!a1&a0) | (!c_in&a1&!a0); > y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0); > > and its propagate and generate bits are > > p = (a1); > g = (a1&a0); > > The maximum number of trailing zeros after the (3*n+1)/2 step is > three and can be determined cheaply from bits <3:1> of the input, > so the input to the shifter (two levels only) can be provided > in parallel. The machine would be designed that adding and > shifting can be done in a single cycle. > > The number of iterations would be tracked; if larger than a prevous > record, the calculation would terminate with a corresponding signal > to the outside world. > > Also, each cycle, the result from the previous round would be > compared against the input value. If it is lower, the next number > would be chosen. Finally, the number reached would be checked against > the previous, pre-set maximum, with would also signal to the outside > world. > > Which numbers to chose? It is well known that only certain > numbers with remainders modulo 2^n are interesting. The number > can be seen on OEIS A076227. For example, modulo 2^10, only 64 > numbers are "interesting", or 6.25% (but only eight bits would > be needed to store them because the last two are always one). > Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the > amount of numbers to be checked by half. One would use a small ROM, > I presume. The number of bits would be a tradeoff between area and > amount of compuation. My current guess would be that 2^10 could > be a sweet spot. One third of the numbers would further be > excluded by checking for divisiblity by three. > > The problem is embarassingly parallel. One could chose a size so > each parcel is calculated in a few seconds or even minutes. to avoid > having to avoid a large amount of communication with the outside. > > Because all units would be working all the time, and if one has > a smaller unit one could simply build more of them, I suspect that > power would be the main limiting factor. An adder design that > might not be very fast, but uses lower power, low voltage and low > frequency would be beneficial. > > Comments? > I wonder how difficult it would be to implement a parallel tester in an FPGA. It looks simple enough and there are hundreds of DSP blocks available to use. Could test in blocks of hundreds of numbers at a time. Running at 100 MHz * 200 tests at a time = how long to get to 2^71? I wonder if the calculation can be broken down further. Numbers are all just defined by rules. I wonder if some calculus may apply. Limit of function approaches one for a finite number of steps. There may be a whole range of algorithms, is there one for the number two? Then the number three? Occasionally I have made useless machinery. Great puzzle work. Made the game of life in hardware.
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david.brown@hesbynett.no> |
|---|---|
| Date | 2026-02-15 13:14 +0100 |
| Message-ID | <10msdau$3vgrg$2@dont-email.me> |
| In reply to | #114982 |
On 15/02/2026 09:58, Robert Finch wrote: > I wonder how difficult it would be to implement a parallel tester in an > FPGA. It looks simple enough and there are hundreds of DSP blocks > available to use. Could test in blocks of hundreds of numbers at a time. > Running at 100 MHz * 200 tests at a time = how long to get to 2^71? > One of the nice things about this is that you don't need DSP blocks - it all works well with very simple operations well-suited to an FPGA. Your calculation blocks would hold an n-bit number (wider than the number you are testing for, since rounds of the Collatz function can grow a lot) and a counter. Initialise the block to the number you want to test. For each round, check the LSB. If it is 0, right-shift your register. If it is even, replace "x" with "(x << 1) + x". That's just some simple logic - no DSP blocks involved. If your register or your counter overflow, you've found an "interesting" number and can pass that out to a PC to test it fully. There's lots of scope for both high-level work (the housekeeping and tracking of operations for each of your worker elements) and low-level work (getting the carry operations to work at high speed). > I wonder if the calculation can be broken down further. Numbers are all > just defined by rules. I wonder if some calculus may apply. Limit of > function approaches one for a finite number of steps. There may be a > whole range of algorithms, is there one for the number two? Then the > number three? > > Occasionally I have made useless machinery. Great puzzle work. Made the > game of life in hardware. >
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2026-02-15 15:46 +0000 |
| Message-ID | <10mspoj$4cqn$1@dont-email.me> |
| In reply to | #114985 |
David Brown <david.brown@hesbynett.no> schrieb: > On 15/02/2026 09:58, Robert Finch wrote: > >> I wonder how difficult it would be to implement a parallel tester in an >> FPGA. It looks simple enough and there are hundreds of DSP blocks >> available to use. Could test in blocks of hundreds of numbers at a time. >> Running at 100 MHz * 200 tests at a time = how long to get to 2^71? >> > > One of the nice things about this is that you don't need DSP blocks - it > all works well with very simple operations well-suited to an FPGA. Your > calculation blocks would hold an n-bit number (wider than the number you > are testing for, since rounds of the Collatz function can grow a lot) > and a counter. Initialise the block to the number you want to test. > For each round, check the LSB. If it is 0, right-shift your register. > If it is even, replace "x" with "(x << 1) + x". That's just some simple > logic - no DSP blocks involved. If your register or your counter > overflow, you've found an "interesting" number and can pass that out to > a PC to test it fully. It is even simpler than that - using the optimization of always calculating (3*n+1)/2 is ideal for look-up tables. If you have six-bit LUTs, you divide your number into five-bit blocks and set up your LUTs for generating (3*n+1+carry_in)/2 - five LUTs for five result bits. You then also need two LUTs for generating the propagate and generate bits for the five-bit groups. You can then generate group generate and propagate bits for three five-bit groups with two LUTs (15 bits) and so on. Three levels of P and G would give you 135 bits, more than enough. The comparators could be more expensive; if this is done with LUTs, you could only compare three bits each. Maybe this is better with the DSP blocks, I don't know. > There's lots of scope for both high-level work (the housekeeping and > tracking of operations for each of your worker elements) and low-level > work (getting the carry operations to work at high speed). Yep. -- This USENET posting was made without artificial intelligence, artificial impertinence, artificial arrogance, artificial stupidity, artificial flavorings or artificial colorants.
[toc] | [prev] | [next] | [standalone]
| From | Robert Finch <robfi680@gmail.com> |
|---|---|
| Date | 2026-02-15 10:56 -0500 |
| Message-ID | <10msqc3$4011$1@dont-email.me> |
| In reply to | #114985 |
On 2026-02-15 7:14 a.m., David Brown wrote: > On 15/02/2026 09:58, Robert Finch wrote: > >> I wonder how difficult it would be to implement a parallel tester in >> an FPGA. It looks simple enough and there are hundreds of DSP blocks >> available to use. Could test in blocks of hundreds of numbers at a time. >> Running at 100 MHz * 200 tests at a time = how long to get to 2^71? >> > > One of the nice things about this is that you don't need DSP blocks - it Yeah, I figured that out after coding. 3n+1 = n+n+n+1 > all works well with very simple operations well-suited to an FPGA. Your > calculation blocks would hold an n-bit number (wider than the number you > are testing for, since rounds of the Collatz function can grow a lot) > and a counter. Initialise the block to the number you want to test. For > each round, check the LSB. If it is 0, right-shift your register. If it > is even, replace "x" with "(x << 1) + x". That's just some simple logic Oops. I thought that was x = (x >> 1) if it is even. I coded it stupidly. It may not matter in FPGA logic. All the bits goo into LUTs and it get figured out. > - no DSP blocks involved. If your register or your counter overflow, > you've found an "interesting" number and can pass that out to a PC to > test it fully. > I have coded a version that tests up to 8192 number simultaneously. It is supposed to be able to run at over 200 MHz. Some of the number would take a large number of iterations though. I found something strange, the code to count for more numbers does not increase the size of the core very much. (But it takes longer to build.) I am not sure if this occurs because there is a mitsake somewhere, or if the tools are able to figure out how to share all the logic. I am thinking the logic for adders is shared somehow. 128-bit arithmetic is being used. > There's lots of scope for both high-level work (the housekeeping and > tracking of operations for each of your worker elements) and low-level > work (getting the carry operations to work at high speed). > Yeah, I am wondering how to display results without using a CPU. I was thinking a running indicator of the current number group under test. I was going to have it stop after the number exceeds 128-bits. >> I wonder if the calculation can be broken down further. Numbers are >> all just defined by rules. I wonder if some calculus may apply. Limit >> of function approaches one for a finite number of steps. There may be >> a whole range of algorithms, is there one for the number two? Then the >> number three? >> >> Occasionally I have made useless machinery. Great puzzle work. Made >> the game of life in hardware. >> >
[toc] | [prev] | [next] | [standalone]
| From | MitchAlsup <user5857@newsgrouper.org.invalid> |
|---|---|
| Date | 2026-02-15 19:16 +0000 |
| Message-ID | <1771182989-5857@newsgrouper.org> |
| In reply to | #114993 |
Robert Finch <robfi680@gmail.com> posted: > On 2026-02-15 7:14 a.m., David Brown wrote: > > On 15/02/2026 09:58, Robert Finch wrote: > > > >> I wonder how difficult it would be to implement a parallel tester in > >> an FPGA. It looks simple enough and there are hundreds of DSP blocks > >> available to use. Could test in blocks of hundreds of numbers at a time. > >> Running at 100 MHz * 200 tests at a time = how long to get to 2^71? > >> > > > > One of the nice things about this is that you don't need DSP blocks - it > Yeah, I figured that out after coding. 3n+1 = n+n+n+1 3n+1 = (n<<1)+n+1 one add with carry in and n gets wired in directly and also shifted up by 1 (no logic shifter).
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <terje.mathisen@tmsw.no> |
|---|---|
| Date | 2026-02-15 22:40 +0100 |
| Message-ID | <10mteg5$c68b$1@dont-email.me> |
| In reply to | #114998 |
MitchAlsup wrote: > > Robert Finch <robfi680@gmail.com> posted: > >> On 2026-02-15 7:14 a.m., David Brown wrote: >>> On 15/02/2026 09:58, Robert Finch wrote: >>> >>>> I wonder how difficult it would be to implement a parallel tester in >>>> an FPGA. It looks simple enough and there are hundreds of DSP blocks >>>> available to use. Could test in blocks of hundreds of numbers at a time. >>>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71? >>>> >>> >>> One of the nice things about this is that you don't need DSP blocks - it >> Yeah, I figured that out after coding. 3n+1 = n+n+n+1 > > 3n+1 = (n<<1)+n+1 > > one add with carry in and n gets wired in directly and also shifted > up by 1 (no logic shifter). Even better: (3n+1)/2 = n + (n>>1) + 1 I.e two iterations for the cost of one. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2026-02-26 19:31 +0000 |
| Message-ID | <10nq728$1tcqi$1@dont-email.me> |
| In reply to | #115001 |
Terje Mathisen <terje.mathisen@tmsw.no> schrieb: > MitchAlsup wrote: >> >> Robert Finch <robfi680@gmail.com> posted: >> >>> On 2026-02-15 7:14 a.m., David Brown wrote: >>>> On 15/02/2026 09:58, Robert Finch wrote: >>>> >>>>> I wonder how difficult it would be to implement a parallel tester in >>>>> an FPGA. It looks simple enough and there are hundreds of DSP blocks >>>>> available to use. Could test in blocks of hundreds of numbers at a time. >>>>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71? >>>>> >>>> >>>> One of the nice things about this is that you don't need DSP blocks - it >>> Yeah, I figured that out after coding. 3n+1 = n+n+n+1 >> >> 3n+1 = (n<<1)+n+1 >> >> one add with carry in and n gets wired in directly and also shifted >> up by 1 (no logic shifter). > > Even better: > > (3n+1)/2 = n + (n>>1) + 1 > > I.e two iterations for the cost of one. For b = n + (n>>1) + 1, for odd n, a bit of a micro-optimization is possible because lowest-order bit of the result equals bit number 1 from n. -- This USENET posting was made without artificial intelligence, artificial impertinence, artificial arrogance, artificial stupidity, artificial flavorings or artificial colorants.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-18 18:22 +0200 |
| Message-ID | <20260218182251.00004e44@yahoo.com> |
| In reply to | #114993 |
On Sun, 15 Feb 2026 10:56:51 -0500 Robert Finch <robfi680@gmail.com> wrote: > I have coded a version that tests up to 8192 number simultaneously. > It is supposed to be able to run at over 200 MHz. Some of the number > would take a large number of iterations though. > > I found something strange, the code to count for more numbers does > not increase the size of the core very much. (But it takes longer to > build.) I am not sure if this occurs because there is a mitsake > somewhere, or if the tools are able to figure out how to share all > the logic. I am thinking the logic for adders is shared somehow. > 128-bit arithmetic is being used. > There is a mistake in your code. No other explanation is possible. You should look for 3 sorts of mistakes 1. Core itself. The most likely mistake is failing to look for one of the ending conditions. You have to look for 3 conditions: a) - value produced at current step is smaller than initial value. That is successful ending. Here you signal to the scheduler that you are ready to accept the next value. b) - overflow. Next value exceeds the width of register. c) - timeout. You were running for more than N steps. N=8192 looks like the reasonable value. a) and b) are suspected failures. The handling for b) and c) is similar - you raise alert signal and stop. Your test unit has to provide supervisor circuity with the mean to fetch the initial value that caused suspected failure and to clear alert condition. After alert is cleared your unit returns to the same state as after success - waiting for next initial value. 2. Scheduler. In proof-of-concept design the scheduler does not have to be very realistic. For example, it does not have to be efficiently pipelined. But it should be able to monitor ready signals of all test units and to supply different initial data to different units. Supplying the same data to all units simultaneously is a sort of mistake that explains your observed behavior. 3. Supervisor Again, in proof-of-concept design the supervisor does not have to be very realistic. But it should be able to selectively exercise all core feature related to alert, including fetching of initial data and clearing of alert. After you coded everything correctly, do not even try to compile test with 8000 units. That is totally unrealistic. Even if you find huge FPGA in which such number of testers could fit (and I am not sure that such FPGA exists) the compilation will take many hours, possibly days.
[toc] | [prev] | [next] | [standalone]
| From | Robert Finch <robfi680@gmail.com> |
|---|---|
| Date | 2026-02-18 22:56 -0500 |
| Message-ID | <10n61l5$36tk6$1@dont-email.me> |
| In reply to | #115034 |
On 2026-02-18 11:22 a.m., Michael S wrote: > On Sun, 15 Feb 2026 10:56:51 -0500 > Robert Finch <robfi680@gmail.com> wrote: > >> I have coded a version that tests up to 8192 number simultaneously. >> It is supposed to be able to run at over 200 MHz. Some of the number >> would take a large number of iterations though. >> >> I found something strange, the code to count for more numbers does >> not increase the size of the core very much. (But it takes longer to >> build.) I am not sure if this occurs because there is a mitsake >> somewhere, or if the tools are able to figure out how to share all >> the logic. I am thinking the logic for adders is shared somehow. >> 128-bit arithmetic is being used. >> > > There is a mistake in your code. No other explanation is possible. > > You should look for 3 sorts of mistakes > 1. Core itself. > The most likely mistake is failing to look for one of the ending > conditions. You have to look for 3 conditions: > a) - value produced at current step is smaller than initial value. > That is successful ending. Here you signal to the scheduler that you > are ready to accept the next value. > b) - overflow. Next value exceeds the width of register. > c) - timeout. You were running for more than N steps. N=8192 looks > like the reasonable value. > a) and b) are suspected failures. > The handling for b) and c) is similar - you raise alert signal and stop. > Your test unit has to provide supervisor circuity with the mean to > fetch the initial value that caused suspected failure and to clear > alert condition. > After alert is cleared your unit returns to the same state as after > success - waiting for next initial value. I found the mistake. Core was working, but groups of cores were not. Done signals were missing. > > 2. Scheduler. > In proof-of-concept design the scheduler does not have to be very > realistic. For example, it does not have to be efficiently pipelined. > But it should be able to monitor ready signals of all test units and to > supply different initial data to different units. > Supplying the same data to all units simultaneously is a sort of > mistake that explains your observed behavior. > There is really simple scheduling ATM. There is just a bunch of done signals that are gated together. Meaning the slowest calculation of the group will slow everything else down. Once all the dones are true the next set of numbers are loaded. > 3. Supervisor > Again, in proof-of-concept design the supervisor does not have to be > very realistic. But it should be able to selectively exercise all core > feature related to alert, including fetching of initial data and > clearing of alert. > > After you coded everything correctly, do not even try to compile test > with 8000 units. That is totally unrealistic. Even if you find huge > FPGA in which such number of testers could fit (and I am not sure that > such FPGA exists) the compilation will take many hours, possibly days. > > > Yeah 8,000 was definitely out of the picture. I had since coded 24x24 matrix (576) and although the logic would fit that turned out not to be able to route. So, it is down to 20x20. 20x20 with no real scheduling fits. I tried displaying progress as an on-screen bitmap, but that just appeared noisy. I have been working on an SCP so may be able to use that to manage the scheduling with some custom logic.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-19 12:10 +0200 |
| Message-ID | <20260219121041.0000505b@yahoo.com> |
| In reply to | #115038 |
On Wed, 18 Feb 2026 22:56:19 -0500 Robert Finch <robfi680@gmail.com> wrote: > On 2026-02-18 11:22 a.m., Michael S wrote: > > On Sun, 15 Feb 2026 10:56:51 -0500 > > Robert Finch <robfi680@gmail.com> wrote: > > > >> I have coded a version that tests up to 8192 number simultaneously. > >> It is supposed to be able to run at over 200 MHz. Some of the > >> number would take a large number of iterations though. > >> > >> I found something strange, the code to count for more numbers does > >> not increase the size of the core very much. (But it takes longer > >> to build.) I am not sure if this occurs because there is a mitsake > >> somewhere, or if the tools are able to figure out how to share all > >> the logic. I am thinking the logic for adders is shared somehow. > >> 128-bit arithmetic is being used. > >> > > > > There is a mistake in your code. No other explanation is possible. > > > > You should look for 3 sorts of mistakes > > 1. Core itself. > > The most likely mistake is failing to look for one of the ending > > conditions. You have to look for 3 conditions: > > a) - value produced at current step is smaller than initial value. > > That is successful ending. Here you signal to the scheduler that > > you are ready to accept the next value. > > b) - overflow. Next value exceeds the width of register. > > c) - timeout. You were running for more than N steps. N=8192 looks > > like the reasonable value. > > a) and b) are suspected failures. > > The handling for b) and c) is similar - you raise alert signal and > > stop. Your test unit has to provide supervisor circuity with the > > mean to fetch the initial value that caused suspected failure and > > to clear alert condition. > > After alert is cleared your unit returns to the same state as after > > success - waiting for next initial value. > > I found the mistake. Core was working, but groups of cores were not. > Done signals were missing. > > > > > 2. Scheduler. > > In proof-of-concept design the scheduler does not have to be very > > realistic. For example, it does not have to be efficiently > > pipelined. But it should be able to monitor ready signals of all > > test units and to supply different initial data to different units. > > Supplying the same data to all units simultaneously is a sort of > > mistake that explains your observed behavior. > > > There is really simple scheduling ATM. There is just a bunch of done > signals that are gated together. Meaning the slowest calculation of > the group will slow everything else down. Once all the dones are true > the next set of numbers are loaded. > > > 3. Supervisor > > Again, in proof-of-concept design the supervisor does not have to be > > very realistic. But it should be able to selectively exercise all > > core feature related to alert, including fetching of initial data > > and clearing of alert. > > > > After you coded everything correctly, do not even try to compile > > test with 8000 units. That is totally unrealistic. Even if you find > > huge FPGA in which such number of testers could fit (and I am not > > sure that such FPGA exists) the compilation will take many hours, > > possibly days. > > > > > > > Yeah 8,000 was definitely out of the picture. I had since coded 24x24 > matrix (576) and although the logic would fit that turned out not to > be able to route. So, it is down to 20x20. 20x20 with no real > scheduling fits. > > I tried displaying progress as an on-screen bitmap, but that just > appeared noisy. I have been working on an SCP so may be able to use > that to manage the scheduling with some custom logic. > > > Does 20x20 means 400 cores with something like 80-bit initial values and 128-bit processing? If you didn't make further mistakes then it should be rather big, order of 300K LEs. It would not fit in the biggest low-end Altera device (Cyclone 10GX 10CX220). On the Xilinx side, it would not fit in any Spartan and in most Artix devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P). But more realistically for 400 test units one would want Arria 10 on Altera side or Kintex on Xilinx side. Both are not prohibitively expensive, but not cheap. What device are you using? If it is smaller than Kintex KU5P then I'd strongly suspect that you didn't clean out all mistakes.
[toc] | [prev] | [next] | [standalone]
| From | Robert Finch <robfi680@gmail.com> |
|---|---|
| Date | 2026-02-21 01:56 -0500 |
| Message-ID | <10nbku5$11ko8$1@dont-email.me> |
| In reply to | #115041 |
On 2026-02-19 5:10 a.m., Michael S wrote: > On Wed, 18 Feb 2026 22:56:19 -0500 > Robert Finch <robfi680@gmail.com> wrote: > >> On 2026-02-18 11:22 a.m., Michael S wrote: >>> On Sun, 15 Feb 2026 10:56:51 -0500 >>> Robert Finch <robfi680@gmail.com> wrote: >>> >>>> I have coded a version that tests up to 8192 number simultaneously. >>>> It is supposed to be able to run at over 200 MHz. Some of the >>>> number would take a large number of iterations though. >>>> >>>> I found something strange, the code to count for more numbers does >>>> not increase the size of the core very much. (But it takes longer >>>> to build.) I am not sure if this occurs because there is a mitsake >>>> somewhere, or if the tools are able to figure out how to share all >>>> the logic. I am thinking the logic for adders is shared somehow. >>>> 128-bit arithmetic is being used. >>>> >>> >>> There is a mistake in your code. No other explanation is possible. >>> >>> You should look for 3 sorts of mistakes >>> 1. Core itself. >>> The most likely mistake is failing to look for one of the ending >>> conditions. You have to look for 3 conditions: >>> a) - value produced at current step is smaller than initial value. >>> That is successful ending. Here you signal to the scheduler that >>> you are ready to accept the next value. >>> b) - overflow. Next value exceeds the width of register. >>> c) - timeout. You were running for more than N steps. N=8192 looks >>> like the reasonable value. >>> a) and b) are suspected failures. >>> The handling for b) and c) is similar - you raise alert signal and >>> stop. Your test unit has to provide supervisor circuity with the >>> mean to fetch the initial value that caused suspected failure and >>> to clear alert condition. >>> After alert is cleared your unit returns to the same state as after >>> success - waiting for next initial value. >> >> I found the mistake. Core was working, but groups of cores were not. >> Done signals were missing. >> >>> >>> 2. Scheduler. >>> In proof-of-concept design the scheduler does not have to be very >>> realistic. For example, it does not have to be efficiently >>> pipelined. But it should be able to monitor ready signals of all >>> test units and to supply different initial data to different units. >>> Supplying the same data to all units simultaneously is a sort of >>> mistake that explains your observed behavior. >>> >> There is really simple scheduling ATM. There is just a bunch of done >> signals that are gated together. Meaning the slowest calculation of >> the group will slow everything else down. Once all the dones are true >> the next set of numbers are loaded. >> >>> 3. Supervisor >>> Again, in proof-of-concept design the supervisor does not have to be >>> very realistic. But it should be able to selectively exercise all >>> core feature related to alert, including fetching of initial data >>> and clearing of alert. >>> >>> After you coded everything correctly, do not even try to compile >>> test with 8000 units. That is totally unrealistic. Even if you find >>> huge FPGA in which such number of testers could fit (and I am not >>> sure that such FPGA exists) the compilation will take many hours, >>> possibly days. >>> >>> >>> >> Yeah 8,000 was definitely out of the picture. I had since coded 24x24 >> matrix (576) and although the logic would fit that turned out not to >> be able to route. So, it is down to 20x20. 20x20 with no real >> scheduling fits. >> >> I tried displaying progress as an on-screen bitmap, but that just >> appeared noisy. I have been working on an SCP so may be able to use >> that to manage the scheduling with some custom logic. >> >> >> > > Does 20x20 means 400 cores with something like 80-bit initial values > and 128-bit processing? Yes. 128-bit processing. > If you didn't make further mistakes then it should be rather big, order > of 300K LEs. It would not fit in the biggest low-end Altera device > (Cyclone 10GX 10CX220). IIRC it is about 150k LUTs. It uses the simplest approach, which is not very fast iteration wise, but it allows a lot of cores and a fast clock cycles time. One core is very small and has a fast iteration (1 clock cycle). But it takes a lot more iterations than can be done with a more brainiac approach. Using a more brainiac approach would likely cut the performance in half and use a lot more LUTs. > On the Xilinx side, it would not fit in any Spartan and in most Artix > devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P). > But more realistically for 400 test units one would want Arria 10 on > Altera side or Kintex on Xilinx side. Both are not prohibitively > expensive, but not cheap. > > What device are you using? I am not using an too inexpensive device. I have a Kintex-325 with about 200k LUTs. > If it is smaller than Kintex KU5P then I'd strongly suspect that you > didn't clean out all mistakes. > > > There could be more mistakes for sure. But I am sure the lowest level core works. It seemed to in simulation. I made a slight mod to it since I tested it last, checking n < startn. Counts are not used so the logic would be stripped out, but its good for verification. 1 Core is: module Collatz_conjecture(rst, clk, startn, count, done); input rst; input clk; input [127:0] startn; output reg [127:0] count; output reg done; reg [127:0] n; always_ff @(posedge clk) if (rst) begin n <= startn; count <= 0; done <= FALSE; end else begin if (!done) begin if (~n[0]) n <= n >> 2'd1; else n <= n+n+n+1; if (n < startn) done <= TRUE; count <= count + 1; end end endmodule
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-21 19:23 +0200 |
| Message-ID | <20260221192302.000030d9@yahoo.com> |
| In reply to | #115066 |
On Sat, 21 Feb 2026 01:56:03 -0500 Robert Finch <robfi680@gmail.com> wrote: > On 2026-02-19 5:10 a.m., Michael S wrote: > > On Wed, 18 Feb 2026 22:56:19 -0500 > > Robert Finch <robfi680@gmail.com> wrote: > > > >> On 2026-02-18 11:22 a.m., Michael S wrote: > >>> On Sun, 15 Feb 2026 10:56:51 -0500 > >>> Robert Finch <robfi680@gmail.com> wrote: > >>> > >>>> I have coded a version that tests up to 8192 number > >>>> simultaneously. It is supposed to be able to run at over 200 > >>>> MHz. Some of the number would take a large number of iterations > >>>> though. > >>>> > >>>> I found something strange, the code to count for more numbers > >>>> does not increase the size of the core very much. (But it takes > >>>> longer to build.) I am not sure if this occurs because there is > >>>> a mitsake somewhere, or if the tools are able to figure out how > >>>> to share all the logic. I am thinking the logic for adders is > >>>> shared somehow. 128-bit arithmetic is being used. > >>>> > >>> > >>> There is a mistake in your code. No other explanation is possible. > >>> > >>> You should look for 3 sorts of mistakes > >>> 1. Core itself. > >>> The most likely mistake is failing to look for one of the ending > >>> conditions. You have to look for 3 conditions: > >>> a) - value produced at current step is smaller than initial value. > >>> That is successful ending. Here you signal to the scheduler > >>> that you are ready to accept the next value. > >>> b) - overflow. Next value exceeds the width of register. > >>> c) - timeout. You were running for more than N steps. N=8192 looks > >>> like the reasonable value. > >>> a) and b) are suspected failures. > >>> The handling for b) and c) is similar - you raise alert signal and > >>> stop. Your test unit has to provide supervisor circuity with the > >>> mean to fetch the initial value that caused suspected failure and > >>> to clear alert condition. > >>> After alert is cleared your unit returns to the same state as > >>> after success - waiting for next initial value. > >> > >> I found the mistake. Core was working, but groups of cores were > >> not. Done signals were missing. > >> > >>> > >>> 2. Scheduler. > >>> In proof-of-concept design the scheduler does not have to be very > >>> realistic. For example, it does not have to be efficiently > >>> pipelined. But it should be able to monitor ready signals of all > >>> test units and to supply different initial data to different > >>> units. Supplying the same data to all units simultaneously is a > >>> sort of mistake that explains your observed behavior. > >>> > >> There is really simple scheduling ATM. There is just a bunch of > >> done signals that are gated together. Meaning the slowest > >> calculation of the group will slow everything else down. Once all > >> the dones are true the next set of numbers are loaded. > >> > >>> 3. Supervisor > >>> Again, in proof-of-concept design the supervisor does not have to > >>> be very realistic. But it should be able to selectively exercise > >>> all core feature related to alert, including fetching of initial > >>> data and clearing of alert. > >>> > >>> After you coded everything correctly, do not even try to compile > >>> test with 8000 units. That is totally unrealistic. Even if you > >>> find huge FPGA in which such number of testers could fit (and I > >>> am not sure that such FPGA exists) the compilation will take many > >>> hours, possibly days. > >>> > >>> > >>> > >> Yeah 8,000 was definitely out of the picture. I had since coded > >> 24x24 matrix (576) and although the logic would fit that turned > >> out not to be able to route. So, it is down to 20x20. 20x20 with > >> no real scheduling fits. > >> > >> I tried displaying progress as an on-screen bitmap, but that just > >> appeared noisy. I have been working on an SCP so may be able to use > >> that to manage the scheduling with some custom logic. > >> > >> > >> > > > > Does 20x20 means 400 cores with something like 80-bit initial values > > and 128-bit processing? > Yes. 128-bit processing. > > > If you didn't make further mistakes then it should be rather big, > > order of 300K LEs. It would not fit in the biggest low-end Altera > > device (Cyclone 10GX 10CX220). > > IIRC it is about 150k LUTs. It uses the simplest approach, which is > not very fast iteration wise, but it allows a lot of cores and a fast > clock cycles time. One core is very small and has a fast iteration (1 > clock cycle). But it takes a lot more iterations than can be done > with a more brainiac approach. > > Using a more brainiac approach would likely cut the performance in > half and use a lot more LUTs. > > > On the Xilinx side, it would not fit in any Spartan and in most > > Artix devices. May be, it would fit in the biggest Artix > > UltraScale+ (AU25P). But more realistically for 400 test units one > > would want Arria 10 on Altera side or Kintex on Xilinx side. Both > > are not prohibitively expensive, but not cheap. > > > > What device are you using? > I am not using an too inexpensive device. I have a Kintex-325 with > about 200k LUTs. > > > If it is smaller than Kintex KU5P then I'd strongly suspect that you > > didn't clean out all mistakes. > > > > > > > There could be more mistakes for sure. But I am sure the lowest level > core works. It seemed to in simulation. I made a slight mod to it > since I tested it last, checking n < startn. > Counts are not used so the logic would be stripped out, but its good > for verification. > > 1 Core is: > > module Collatz_conjecture(rst, clk, startn, count, done); > input rst; > input clk; > input [127:0] startn; > output reg [127:0] count; > output reg done; > reg [127:0] n; > > always_ff @(posedge clk) > if (rst) begin > n <= startn; > count <= 0; > done <= FALSE; > end > else begin > if (!done) begin > if (~n[0]) > n <= n >> 2'd1; > else > n <= n+n+n+1; > if (n < startn) > done <= TRUE; > count <= count + 1; > end > end > > endmodule > >
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-21 20:36 +0200 |
| Message-ID | <20260221203651.00000b57@yahoo.com> |
| In reply to | #115066 |
On Sat, 21 Feb 2026 01:56:03 -0500
Robert Finch <robfi680@gmail.com> wrote:
> On 2026-02-19 5:10 a.m., Michael S wrote:
> > On Wed, 18 Feb 2026 22:56:19 -0500
> > Robert Finch <robfi680@gmail.com> wrote:
> >
> >> On 2026-02-18 11:22 a.m., Michael S wrote:
> >>> On Sun, 15 Feb 2026 10:56:51 -0500
> >>> Robert Finch <robfi680@gmail.com> wrote:
> >>>
> >>>> I have coded a version that tests up to 8192 number
> >>>> simultaneously. It is supposed to be able to run at over 200
> >>>> MHz. Some of the number would take a large number of iterations
> >>>> though.
> >>>>
> >>>> I found something strange, the code to count for more numbers
> >>>> does not increase the size of the core very much. (But it takes
> >>>> longer to build.) I am not sure if this occurs because there is
> >>>> a mitsake somewhere, or if the tools are able to figure out how
> >>>> to share all the logic. I am thinking the logic for adders is
> >>>> shared somehow. 128-bit arithmetic is being used.
> >>>>
> >>>
> >>> There is a mistake in your code. No other explanation is possible.
> >>>
> >>> You should look for 3 sorts of mistakes
> >>> 1. Core itself.
> >>> The most likely mistake is failing to look for one of the ending
> >>> conditions. You have to look for 3 conditions:
> >>> a) - value produced at current step is smaller than initial value.
> >>> That is successful ending. Here you signal to the scheduler
> >>> that you are ready to accept the next value.
> >>> b) - overflow. Next value exceeds the width of register.
> >>> c) - timeout. You were running for more than N steps. N=8192 looks
> >>> like the reasonable value.
> >>> a) and b) are suspected failures.
> >>> The handling for b) and c) is similar - you raise alert signal and
> >>> stop. Your test unit has to provide supervisor circuity with the
> >>> mean to fetch the initial value that caused suspected failure and
> >>> to clear alert condition.
> >>> After alert is cleared your unit returns to the same state as
> >>> after success - waiting for next initial value.
> >>
> >> I found the mistake. Core was working, but groups of cores were
> >> not. Done signals were missing.
> >>
> >>>
> >>> 2. Scheduler.
> >>> In proof-of-concept design the scheduler does not have to be very
> >>> realistic. For example, it does not have to be efficiently
> >>> pipelined. But it should be able to monitor ready signals of all
> >>> test units and to supply different initial data to different
> >>> units. Supplying the same data to all units simultaneously is a
> >>> sort of mistake that explains your observed behavior.
> >>>
> >> There is really simple scheduling ATM. There is just a bunch of
> >> done signals that are gated together. Meaning the slowest
> >> calculation of the group will slow everything else down. Once all
> >> the dones are true the next set of numbers are loaded.
> >>
> >>> 3. Supervisor
> >>> Again, in proof-of-concept design the supervisor does not have to
> >>> be very realistic. But it should be able to selectively exercise
> >>> all core feature related to alert, including fetching of initial
> >>> data and clearing of alert.
> >>>
> >>> After you coded everything correctly, do not even try to compile
> >>> test with 8000 units. That is totally unrealistic. Even if you
> >>> find huge FPGA in which such number of testers could fit (and I
> >>> am not sure that such FPGA exists) the compilation will take many
> >>> hours, possibly days.
> >>>
> >>>
> >>>
> >> Yeah 8,000 was definitely out of the picture. I had since coded
> >> 24x24 matrix (576) and although the logic would fit that turned
> >> out not to be able to route. So, it is down to 20x20. 20x20 with
> >> no real scheduling fits.
> >>
> >> I tried displaying progress as an on-screen bitmap, but that just
> >> appeared noisy. I have been working on an SCP so may be able to use
> >> that to manage the scheduling with some custom logic.
> >>
> >>
> >>
> >
> > Does 20x20 means 400 cores with something like 80-bit initial values
> > and 128-bit processing?
> Yes. 128-bit processing.
>
> > If you didn't make further mistakes then it should be rather big,
> > order of 300K LEs. It would not fit in the biggest low-end Altera
> > device (Cyclone 10GX 10CX220).
>
> IIRC it is about 150k LUTs. It uses the simplest approach, which is
> not very fast iteration wise, but it allows a lot of cores and a fast
> clock cycles time. One core is very small and has a fast iteration (1
> clock cycle). But it takes a lot more iterations than can be done
> with a more brainiac approach.
>
> Using a more brainiac approach would likely cut the performance in
> half and use a lot more LUTs.
>
According to my experiments, combining 2 steps have very small negative
effect on achivable clock and increases area by ~1.8x. So, to me it
looks like a win.
That's the combined step I am talking about:
x = n/4;
switch (x % 4) {
case 0: n = x; break;
case 1: n = 3*x+1; break;
case 2: n = 2*x+2; break;
case 3: n = 9*x+8; break;
}
> > On the Xilinx side, it would not fit in any Spartan and in most
> > Artix devices. May be, it would fit in the biggest Artix
> > UltraScale+ (AU25P). But more realistically for 400 test units one
> > would want Arria 10 on Altera side or Kintex on Xilinx side. Both
> > are not prohibitively expensive, but not cheap.
> >
> > What device are you using?
> I am not using an too inexpensive device. I have a Kintex-325 with
> about 200k LUTs.
>
According to Xilinx marketing materials XC7K325T has 326080 equivalents
of traditional LEs.
That explains why your design fits.
Pay attention that while Kintex evaluation boards could be
non-expensive, right now buying devices for actual production work is
rather costly, especially in small quantities.
The lowest price I see on digikey site is 1147 USD. That's a bad value
relatively to many other FPGAs.
> > If it is smaller than Kintex KU5P then I'd strongly suspect that you
> > didn't clean out all mistakes.
> >
> >
> >
> There could be more mistakes for sure. But I am sure the lowest level
> core works. It seemed to in simulation. I made a slight mod to it
> since I tested it last, checking n < startn.
> Counts are not used so the logic would be stripped out, but its good
> for verification.
>
> 1 Core is:
>
> module Collatz_conjecture(rst, clk, startn, count, done);
> input rst;
> input clk;
> input [127:0] startn;
> output reg [127:0] count;
> output reg done;
> reg [127:0] n;
>
> always_ff @(posedge clk)
> if (rst) begin
> n <= startn;
> count <= 0;
> done <= FALSE;
> end
> else begin
> if (!done) begin
> if (~n[0])
> n <= n >> 2'd1;
> else
> n <= n+n+n+1;
> if (n < startn)
> done <= TRUE;
> count <= count + 1;
> end
> end
>
> endmodule
>
Ok.
1. Your core misses one critical part - overflow check.
Assuming 80-bit initial count, 128-bit register will overflow
eventually. It would happen very rarely, but it would happen.
The check for overflow does not have to be precise. Conservative
approach, like looking at 2 MS bits and one LS bit and signaling
overflow not when it really happened but when it *could* happen, i.e.
when MS bits > 0 and LS bit='1' is good enough. But the check can't be
omitted.
2. I see that instead of calculating n = (n*3+1)/2 as one step you are
doing it in two separate steps. I am not sure that it is really makes
your core smaller or faster. And it certainly increases # of iterations
by 1/3rd.
3.
128-bit count is pointless loss of resources. 12 or 13 bit are
sufficient. If the count exceeded that size then you already found
something of high interest and it is a job of supervisor to find out
what exactly had happened.
128-bit counter actually not just too big, it is also pointless,
because if we pretend to believe that our purpose is to disprove
Collatz's conjecture then supervisor would have to implement stop
condition with some sort of timeout that will effectively duplicate
functionality of the counter.
4.
Your module does not latch the value of startn. It means that caller
module has to do it on it's behalf. That is the most likely place where
one can make a mistake.
5.
Coding style comment. Normally, name rst is used for asynchronous
reset signal. In your design rst is a synchronous signal. For the
benefit of casual reader I suggest to find different name.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-22 10:46 +0200 |
| Message-ID | <20260222104600.0000372b@yahoo.com> |
| In reply to | #115070 |
On Sat, 21 Feb 2026 20:36:51 +0200
Michael S <already5chosen@yahoo.com> wrote:
> >
> > Using a more brainiac approach would likely cut the performance in
> > half and use a lot more LUTs.
> >
>
> According to my experiments, combining 2 steps have very small
> negative effect on achivable clock and increases area by ~1.8x. So,
> to me it looks like a win.
> That's the combined step I am talking about:
> x = n/4;
> switch (x % 4) {
> case 0: n = x; break;
> case 1: n = 3*x+1; break;
> case 2: n = 2*x+2; break;
> case 3: n = 9*x+8; break;
> }
>
>
>
Mistake in C above. Should be
switch (n % 4) {
[toc] | [prev] | [next] | [standalone]
| From | Terje Mathisen <terje.mathisen@tmsw.no> |
|---|---|
| Date | 2026-02-22 12:45 +0100 |
| Message-ID | <10neq97$22qd5$1@dont-email.me> |
| In reply to | #115084 |
Michael S wrote:
> On Sat, 21 Feb 2026 20:36:51 +0200
> Michael S <already5chosen@yahoo.com> wrote:
>
>>>
>>> Using a more brainiac approach would likely cut the performance in
>>> half and use a lot more LUTs.
>>>
>>
>> According to my experiments, combining 2 steps have very small
>> negative effect on achivable clock and increases area by ~1.8x. So,
>> to me it looks like a win.
>> That's the combined step I am talking about:
>> x = n/4;
>> switch (x % 4) {
>> case 0: n = x; break;
>> case 1: n = 3*x+1; break;
>> case 2: n = 2*x+2; break;
>> case 3: n = 9*x+8; break;
>> }
>>
>>
>>
>
> Mistake in C above. Should be
> switch (n % 4) {
I did notice that. :-)
The one thing that worries me (sw on a 64-bit platform) about the code
is the 9* on 128-bit variables:
9*x+8 =>
Do we use SHLD + SHL here or something else?
How about MUL & LEA?
;; Input in r10:r9, output in rdx:rax
mov rax,r9
mul rax,rbx ;; RBX == 9
lea r10,[r10+r10*8]
add rdx,r10
That looks like 5-6 clock cycles, so the branch misses from the switch
statement would probably dominate unless you do as I suggested and use
lookup tables instead:
let bot2 = n & 3;
let x = n >> 2;
n = x*multab[bot2] + addtab[bot2];
but if we do that, then (at least for a sw implementation) it would be
better to pick a lot more of the LS bits, at least 8-12?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2026-02-22 16:35 +0200 |
| Message-ID | <20260222163502.000026d4@yahoo.com> |
| In reply to | #115087 |
On Sun, 22 Feb 2026 12:45:44 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> Michael S wrote:
> > On Sat, 21 Feb 2026 20:36:51 +0200
> > Michael S <already5chosen@yahoo.com> wrote:
> >
> >>>
> >>> Using a more brainiac approach would likely cut the performance in
> >>> half and use a lot more LUTs.
> >>>
> >>
> >> According to my experiments, combining 2 steps have very small
> >> negative effect on achivable clock and increases area by ~1.8x. So,
> >> to me it looks like a win.
> >> That's the combined step I am talking about:
> >> x = n/4;
> >> switch (x % 4) {
> >> case 0: n = x; break;
> >> case 1: n = 3*x+1; break;
> >> case 2: n = 2*x+2; break;
> >> case 3: n = 9*x+8; break;
> >> }
> >>
> >>
> >>
> >
> > Mistake in C above. Should be
> > switch (n % 4) {
>
> I did notice that. :-)
>
> The one thing that worries me (sw on a 64-bit platform) about the
> code is the 9* on 128-bit variables:
>
> 9*x+8 =>
>
> Do we use SHLD + SHL here or something else?
>
> How about MUL & LEA?
>
> ;; Input in r10:r9, output in rdx:rax
> mov rax,r9
> mul rax,rbx ;; RBX == 9
> lea r10,[r10+r10*8]
> add rdx,r10
>
> That looks like 5-6 clock cycles, so the branch misses from the
> switch statement would probably dominate unless you do as I suggested
> and use lookup tables instead:
>
> let bot2 = n & 3;
> let x = n >> 2;
> n = x*multab[bot2] + addtab[bot2];
>
> but if we do that, then (at least for a sw implementation) it would
> be better to pick a lot more of the LS bits, at least 8-12?
>
> Terje
>
This variant is not intended for software implementation.
For software it is obviously sub-optimal for the reasons you suggested.
OTOH, for FPGA it seems to work a little better than both simpler and
more complicated steps.
I presented the code in C only because C is lingua franca of
comp.arch.
[toc] | [prev] | [next] | [standalone]
Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →
Back to top | Article view | comp.arch
csiph-web