Groups > comp.arch > #113595 > unrolled thread

Linus Torvalds on bad architectural features

Started by	anton@mips.complang.tuwien.ac.at (Anton Ertl)
First post	2025-10-03 08:58 +0000
Last post	2025-12-28 21:34 +0000
Articles	20 on this page of 215 — 26 participants

Back to article view | Back to comp.arch

  Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-03 08:58 +0000
    Re: Linus Torvalds on bad architectural features BGB <cr88192@gmail.com> - 2025-10-03 05:40 -0500
    Re: Linus Torvalds on bad architectural features Michael S <already5chosen@yahoo.com> - 2025-10-03 13:46 +0300
      SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) Stefan Monnier <monnier@iro.umontreal.ca> - 2025-10-03 11:26 -0400
        Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-10-03 15:42 +0000
          Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) kegs@provalid.com (Kent Dickey) - 2025-10-03 16:18 +0000
            Re: SASOS and virtually tagged caches Stefan Monnier <monnier@iro.umontreal.ca> - 2025-10-03 15:44 -0400
            Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) George Neuner <gneuner2@comcast.net> - 2025-10-06 06:54 -0400
              Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) kegs@provalid.com (Kent Dickey) - 2025-10-06 16:44 +0000
        Re: SASOS and virtually tagged caches BGB <cr88192@gmail.com> - 2025-10-03 17:42 -0500
        Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) kegs@provalid.com (Kent Dickey) - 2025-10-04 04:36 +0000
          Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) John Levine <johnl@taugh.com> - 2025-10-04 18:36 +0000
            Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) Thomas Koenig <tkoenig@netcologne.de> - 2025-10-04 19:00 +0000
            Re: SASOS and virtually tagged caches Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2025-10-04 12:31 -0700
            Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) Michael S <already5chosen@yahoo.com> - 2025-10-05 01:05 +0300
              Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) John Levine <johnl@taugh.com> - 2025-10-04 22:44 +0000
                Re: SASOS and virtually tagged caches BGB <cr88192@gmail.com> - 2025-10-04 17:57 -0500
                Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) Michael S <already5chosen@yahoo.com> - 2025-10-05 02:18 +0300
                Re: SASOS and virtually tagged caches EricP <ThatWouldBeTelling@thevillage.com> - 2025-10-05 13:02 -0400
            Re: SASOS and virtually tagged caches Lynn Wheeler <lynn@garlic.com> - 2025-10-04 14:17 -1000
            Re: SASOS and virtually tagged caches (was: Linus Torvalds on bad architectural features) kegs@provalid.com (Kent Dickey) - 2025-10-06 15:49 +0000
    Re: Linus Torvalds on bad architectural features MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-10-03 15:41 +0000
      Re: Linus Torvalds on bad architectural features BGB <cr88192@gmail.com> - 2025-10-03 16:19 -0500
    Re: Linus Torvalds on bad architectural features John Savard <quadibloc@invalid.invalid> - 2025-10-09 13:57 +0000
    Re: Linus Torvalds on bad architectural features John Savard <quadibloc@invalid.invalid> - 2025-10-09 21:41 +0000
      Re: Linus Torvalds on bad architectural features MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-10-09 22:10 +0000
      Re: Linus Torvalds on bad architectural features scott@slp53.sl.home (Scott Lurndal) - 2025-10-09 22:21 +0000
        Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-10 08:30 +0000
          Re: Linus Torvalds on bad architectural features scott@slp53.sl.home (Scott Lurndal) - 2025-10-10 15:02 +0000
      Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-11 07:18 +0000
        Re: Linus Torvalds on bad architectural features John Levine <johnl@taugh.com> - 2025-10-12 02:37 +0000
          Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 07:13 +0000
            Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 09:51 +0000
              Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 10:14 +0000
                Re: Linus Torvalds on bad architectural features Michael S <already5chosen@yahoo.com> - 2025-10-12 13:56 +0300
                  Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 11:38 +0000
                    Re: Linus Torvalds on bad architectural features Michael S <already5chosen@yahoo.com> - 2025-10-12 15:31 +0300
                      Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 13:36 +0000
                        Re: Linus Torvalds on bad architectural features Michael S <already5chosen@yahoo.com> - 2025-10-12 20:13 +0300
                          Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 17:47 +0000
                          Re: Linus Torvalds on bad architectural features MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-10-12 19:31 +0000
                Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 13:31 +0000
                  Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 15:10 +0000
                    Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 15:48 +0000
                      Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 16:25 +0000
                        Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 17:25 +0000
                          Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 20:03 +0000
                            Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-10-12 21:07 +0000
                              Re: Linus Torvalds on bad architectural features Robert Swindells <rjs@fdy2.co.uk> - 2025-10-13 17:26 +0000
                    Re: Linus Torvalds on bad architectural features Michael S <already5chosen@yahoo.com> - 2025-10-12 19:56 +0300
                      Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-10-12 17:02 +0000
            Re: Linus Torvalds on bad architectural features John Levine <johnl@taugh.com> - 2025-10-12 21:07 +0000
          Re: Linus Torvalds on bad architectural features MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-10-12 16:11 +0000
            Re: Linus Torvalds on bad architectural features BGB <cr88192@gmail.com> - 2025-10-12 13:04 -0500
            Re: Linus Torvalds on bad architectural features Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-28 00:10 +0000
              Re: Linus Torvalds on bad architectural features MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-12-28 17:43 +0000
                Re: Linus Torvalds on bad architectural features EricP <ThatWouldBeTelling@thevillage.com> - 2025-12-28 13:34 -0500
                Re: Linus Torvalds on bad architectural features Stefan Monnier <monnier@iro.umontreal.ca> - 2025-12-28 13:55 -0500
                  Re: Linus Torvalds on bad architectural features BGB <cr88192@gmail.com> - 2025-12-28 16:09 -0600
                    Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-12-28 23:00 +0000
                    Re: Linus Torvalds on bad architectural features anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-12-29 06:59 +0000
                      Re: Linus Torvalds on bad architectural features Thomas Koenig <tkoenig@netcologne.de> - 2025-12-29 08:17 +0000
                        word order and byte order (was: Linus Torvalds ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2025-12-29 09:08 +0000
                          Re: word order and byte order (was: Linus Torvalds ...) Thomas Koenig <tkoenig@netcologne.de> - 2025-12-29 13:39 +0000
                            Re: word order and byte order (was: Linus Torvalds ...) John Levine <johnl@taugh.com> - 2025-12-31 02:54 +0000
                              Re: word order and byte order (was: Linus Torvalds ...) Thomas Koenig <tkoenig@netcologne.de> - 2025-12-31 09:43 +0000
                                Re: floating point history, word order and byte order (was: Linus Torvalds ...) John Levine <johnl@taugh.com> - 2026-01-01 17:46 +0000
                                  Re: floating point history, word order and byte order (was: Linus Torvalds ...) Thomas Koenig <tkoenig@netcologne.de> - 2026-01-04 00:21 +0000
                                    Re: floating point history, word order and byte order (was: Linus Torvalds ...) John Levine <johnl@taugh.com> - 2026-01-04 04:12 +0000
                                      Re: floating point history, word order and byte order (was: Linus Torvalds ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-04 08:06 +0000
                                        Re: floating point history, word order and byte order (was: Linus Torvalds ...) Michael S <already5chosen@yahoo.com> - 2026-01-04 12:20 +0200
                                          Re: floating point history, word order and byte order (was: Linus Torvalds ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-04 18:01 +0000
                                            Re: floating point history, word order and byte order Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-01-04 15:20 -0800
                                              Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-05 09:08 +0000
                                                Re: floating point history, word order and byte order jgd@cix.co.uk (John Dallman) - 2026-01-06 22:06 +0000
                                                  Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-07 13:34 +0200
                                                    Re: floating point history, word order and byte order jgd@cix.co.uk (John Dallman) - 2026-01-07 13:16 +0000
                                                      Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-07 17:55 +0200
                                                        Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 17:39 +0000
                                                      Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-07 18:58 +0100
                                                        Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 20:05 +0000
                                                          Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-07 23:47 +0200
                                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 22:10 +0000
                                                              Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-11 00:59 +0000
                                                      Re: floating point history, word order and byte order BGB <cr88192@gmail.com> - 2026-01-07 14:23 -0600
                                                        Re: floating point history, word order and byte order Robert Finch <robfi680@gmail.com> - 2026-01-07 21:16 -0500
                                                      Re: floating point history, word order and byte order kegs@provalid.com (Kent Dickey) - 2026-01-28 13:25 +0000
                                                        Re: floating point history, word order and byte order BGB <cr88192@gmail.com> - 2026-01-28 15:26 -0600
                                                          Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-28 23:03 +0000
                                                            Re: floating point history, word order and byte order BGB <cr88192@gmail.com> - 2026-01-28 17:43 -0600
                                                              Re: floating point history, word order and byte order Robert Finch <robfi680@gmail.com> - 2026-01-29 03:47 -0500
                                                                Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-29 21:30 +0000
                                                                Re: floating point history, word order and byte order BGB <cr88192@gmail.com> - 2026-01-29 17:44 -0600
                                                    Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-07 18:56 +0100
                                                      Re: floating point history, word order and byte order BGB <cr88192@gmail.com> - 2026-01-07 14:38 -0600
                                                        Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 21:18 +0000
                                                          Re: floating point history, word order and byte order BGB <cr88192@gmail.com> - 2026-01-07 16:10 -0600
                                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-08 00:05 +0000
                                                              Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-08 02:38 +0000
                                                                Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-08 10:52 +0200
                                                                  Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-08 18:50 +0000
                                                                Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-08 13:10 +0100
                                                                  Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-08 18:54 +0000
                                                                    Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-08 21:35 +0100
                                                                      Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-09 01:24 +0000
                                                                        Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-10 18:02 +0100
                                                                      Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-11 15:01 +0200
                                                                        Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-11 18:18 +0000
                                                                          Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-11 20:50 +0200
                                                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-11 22:11 +0000
                                                                              Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-12 12:20 +0200
                                                                          Re: floating point history, word order and byte order Stefan Monnier <monnier@iro.umontreal.ca> - 2026-01-13 14:45 -0500
                                                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-21 01:44 +0000
                                                                              Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-21 11:05 +0200
                                                                                Re: floating point history, word order and byte order Tim Rentsch <tr.17687@z991.linuxsc.com> - 2026-02-14 20:49 -0800
                                                                              Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-21 22:33 +0100
                                                                                Re: floating point history, word order and byte order Robert Finch <robfi680@gmail.com> - 2026-01-22 14:37 -0500
                                                                              Re: floating point history, word order and byte order George Neuner <gneuner2@comcast.net> - 2026-01-22 15:12 -0500
                                                                                Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-22 22:57 +0200
                                                                                  Re: floating point history, word order and byte order George Neuner <gneuner2@comcast.net> - 2026-01-23 13:47 -0500
                                                                                    [OT] Usenet (was: floating point history ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-24 15:58 +0000
                                                                                      Re: [OT] Usenet (was: floating point history ...) David LaRue <huey.dll@tampabay.rr.com> - 2026-01-25 01:09 +0000
                                                                                      Re: [OT] Usenet (was: floating point history ...) George Neuner <gneuner2@comcast.net> - 2026-01-25 18:12 -0500
                                                                                        Re: [OT] Usenet (was: floating point history ...) scott@slp53.sl.home (Scott Lurndal) - 2026-01-26 21:38 +0000
                                                                                          Re: [OT] Usenet Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> - 2026-01-27 12:19 +0200
                                                                                          Re: [OT] Usenet (was: floating point history ...) George Neuner <gneuner2@comcast.net> - 2026-01-27 06:33 -0500
                                                                              Re: floating point history, word order and byte order jgd@cix.co.uk (John Dallman) - 2026-01-24 09:11 +0000
                                                                              Re: floating point history, word order and byte order Brett <ggtgp@yahoo.com> - 2026-01-25 22:13 +0000
                                                                        Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-11 22:30 +0000
                                                                          Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-12 01:07 +0200
                                                                          Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-12 01:10 +0200
                                                                          Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-12 00:57 +0200
                                                                    Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-11 01:14 +0000
                                                                      Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-11 12:52 +0100
                                                                      Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-11 18:07 +0000
                                                                        Re: floating point history, word order and byte order "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2026-01-11 12:40 -0800
                                                                          Re: floating point history, word order and byte order "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2026-01-11 15:41 -0800
                                                                          Re: floating point history, word order and byte order "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2026-01-11 15:40 -0800
                                                          Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-08 13:05 +0100
                                                    Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-11 00:33 +0000
                                                      Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-11 14:31 +0200
                                                  Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-10 23:21 +0000
                                                    Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-11 00:03 +0000
                                                      Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-11 11:21 +0000
                                                        Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-11 18:08 +0000
                                                    Re: floating point history, word order and byte order Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-01-11 08:38 -0800
                                                      Re: floating point history, word order and byte order John Levine <johnl@taugh.com> - 2026-01-11 19:03 +0000
                                                        Re: floating point history, word order and byte order Thomas Koenig <tkoenig@netcologne.de> - 2026-01-12 22:22 +0000
                                                        Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-13 09:55 +0100
                                        Re: floating point history, word order and byte order (was: Linus Torvalds ...) MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-04 18:47 +0000
                                          Re: floating point history, word order and byte order Stefan Monnier <monnier@iro.umontreal.ca> - 2026-01-04 14:12 -0500
                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-04 20:14 +0000
                                              Re: floating point history, word order and byte order Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-01-04 13:05 -0800
                                              Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-05 08:51 +0000
                                                Re: floating point history, word order and byte order Stefan Monnier <monnier@iro.umontreal.ca> - 2026-01-05 11:55 -0500
                                                  Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-05 19:26 +0200
                                                    Re: floating point history, word order and byte order Stefan Monnier <monnier@iro.umontreal.ca> - 2026-01-05 14:33 -0500
                                                      Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-06 17:26 +0000
                                                  Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-05 17:40 +0000
                                                  Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-07 17:57 +0000
                                                    Re: floating point history, word order and byte order Thomas Koenig <tkoenig@netcologne.de> - 2026-01-09 16:32 +0000
                                                      Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-11 11:49 +0000
                                                        Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-11 18:11 +0000
                                                          Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-12 00:37 +0000
                                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-12 02:05 +0000
                                                          Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-12 16:28 +0100
                                          Re: floating point history, word order and byte order (was: Linus Torvalds ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-05 08:16 +0000
                                            Re: floating point history, word order and byte order (was: Linus Torvalds ...) anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-05 10:21 +0000
                                              Re: floating point history, word order and byte order EricP <ThatWouldBeTelling@thevillage.com> - 2026-01-05 11:05 -0500
                                                Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-05 18:03 +0000
                                                  Re: floating point history, word order and byte order EricP <ThatWouldBeTelling@thevillage.com> - 2026-01-05 13:51 -0500
                                                    Re: floating point history, word order and byte order EricP <ThatWouldBeTelling@thevillage.com> - 2026-01-05 14:21 -0500
                                                      Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-06 17:50 +0000
                                            Re: floating point history, word order and byte order EricP <ThatWouldBeTelling@thevillage.com> - 2026-01-05 10:31 -0500
                                              Re: floating point history, word order and byte order EricP <ThatWouldBeTelling@thevillage.com> - 2026-01-05 11:02 -0500
                                                Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-06 18:14 +0000
                                        Re: floating point history, word order and byte order (was: Linus Torvalds ...) scott@slp53.sl.home (Scott Lurndal) - 2026-01-05 15:40 +0000
                                      Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-04 13:22 +0100
                                    Re: floating point history, word order and byte order (was: Linus Torvalds ...) Michael S <already5chosen@yahoo.com> - 2026-01-04 12:36 +0200
                                      Re: floating point history, word order and byte order (was: Linus Torvalds ...) Thomas Koenig <tkoenig@netcologne.de> - 2026-01-04 16:45 +0000
                                        Re: floating point history, word order and byte order (was: Linus Torvalds ...) Michael S <already5chosen@yahoo.com> - 2026-01-04 19:35 +0200
                                        Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-06 12:35 +0100
                                          Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-06 15:26 +0200
                                            Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-06 17:06 +0100
                                              Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-06 17:59 +0000
                                                Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-06 20:15 +0200
                                                  Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-06 19:35 +0000
                                                    Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-06 23:09 +0200
                                      Re: floating point history, word order and byte order (was: Linus Torvalds ...) MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-06 17:56 +0000
                                        Re: floating point history, word order and byte order (was: Linus Torvalds ...) Michael S <already5chosen@yahoo.com> - 2026-01-06 20:12 +0200
                                          Re: floating point history, word order and byte order (was: Linus Torvalds ...) MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-06 19:29 +0000
                                            Re: floating point history, word order and byte order (was: Linus Torvalds ...) Michael S <already5chosen@yahoo.com> - 2026-01-07 15:06 +0200
                                              Re: floating point history, word order and byte order (was: Linus Torvalds ...) scott@slp53.sl.home (Scott Lurndal) - 2026-01-07 15:24 +0000
                                                Re: floating point history, word order and byte order (was: Linus Torvalds ...) Michael S <already5chosen@yahoo.com> - 2026-01-07 18:06 +0200
                                                  Re: floating point history, word order and byte order (was: Linus Torvalds ...) scott@slp53.sl.home (Scott Lurndal) - 2026-01-07 16:41 +0000
                                                    Re: floating point history, word order and byte order antispam@fricas.org (Waldek Hebisch) - 2026-01-07 17:32 +0000
                                                      Re: floating point history, word order and byte order scott@slp53.sl.home (Scott Lurndal) - 2026-01-07 19:14 +0000
                                                      Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-08 12:50 +0100
                                                        Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-08 15:41 +0200
                                                          Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-08 21:25 +0100
                                                            Re: floating point history, word order and byte order Michael S <already5chosen@yahoo.com> - 2026-01-08 22:50 +0200
                                                  Re: floating point history, word order and byte order (was: Linus Torvalds ...) MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 17:44 +0000
                                              Re: floating point history, word order and byte order Stephen Fuld <sfuld@alumni.cmu.edu.invalid> - 2026-01-07 10:22 -0800
                                            Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-07 18:47 +0100
                                        Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-07 18:38 +0100
                                          Re: floating point history, word order and byte order anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2026-01-07 18:38 +0000
                                            Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 20:11 +0000
                                              Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-08 13:01 +0100
                                                Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-08 18:52 +0000
                                          Re: floating point history, word order and byte order scott@slp53.sl.home (Scott Lurndal) - 2026-01-07 19:19 +0000
                                          Re: floating point history, word order and byte order MitchAlsup <user5857@newsgrouper.org.invalid> - 2026-01-07 19:56 +0000
                                    Re: floating point history, word order and byte order Terje Mathisen <terje.mathisen@tmsw.no> - 2026-01-04 13:17 +0100
                    Re: Linus Torvalds on bad architectural features Stefan Monnier <monnier@iro.umontreal.ca> - 2025-12-29 13:48 -0500
              Re: Linus Torvalds on bad architectural features Bill Findlay <findlaybill@blueyonder.co.uk> - 2025-12-28 19:21 +0000
                Re: Linus Torvalds on bad architectural features MitchAlsup <user5857@newsgrouper.org.invalid> - 2025-12-28 21:34 +0000

Page 5 of 11 — ← Prev page 1 … 3 4 [5] 6 7 … 11 Next page →

#114681 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-07 20:05 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1767816317-5857@newsgrouper.org>
In reply to	#114675

Terje Mathisen <terje.mathisen@tmsw.no> posted:

> John Dallman wrote:
> > In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
> > (Michael S) wrote:
> > 
> >> I already asked you couple of years ago how fast do want binary128
> >> in order to consider it fast enough.
> >> IIRC, you either avoided the answer completely or gave totally
> >> unrealistic answer like "the same as binary64".
> >> May be, nobody bites because with non-answers or answers like that
> >> nobody thinks that you are serious?
> > 
> > I don't know much about hardware design. What is realistic for hardware
> > binary128?
> 
> Sub-10 cycles fmul/fadd/fsub seems very doable?
> 
> Mitch?
> 
Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::

128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

128-bit Fmul requires that the multiplier tree be 64×64 instead of
53×53 (1.46× bigger tree, 1.22× bigger FU), and would/should be 3-4
cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59×59 tree and the FU is only 1.12× bigger; but here
you could not use the tree for Integer MUL.

> Terje
>

[toc] | [prev] | [next] | [standalone]

#114686 — Re: floating point history, word order and byte order

From	Michael S <already5chosen@yahoo.com>
Date	2026-01-07 23:47 +0200
Subject	Re: floating point history, word order and byte order
Message-ID	<20260107234706.000033a7@yahoo.com>
In reply to	#114681

On Wed, 07 Jan 2026 20:05:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

> Terje Mathisen <terje.mathisen@tmsw.no> posted:
> 
> > John Dallman wrote:  
> > > In article <20260107133424.00000e99@yahoo.com>,
> > > already5chosen@yahoo.com (Michael S) wrote:
> > >   
> > >> I already asked you couple of years ago how fast do want
> > >> binary128 in order to consider it fast enough.
> > >> IIRC, you either avoided the answer completely or gave totally
> > >> unrealistic answer like "the same as binary64".
> > >> May be, nobody bites because with non-answers or answers like
> > >> that nobody thinks that you are serious?  
> > > 
> > > I don't know much about hardware design. What is realistic for
> > > hardware binary128?  
> > 
> > Sub-10 cycles fmul/fadd/fsub seems very doable?
> > 
> > Mitch?
> >   
> Assuming 128-bit operands are delivered in 1 cycle and 128-bit
> results are delivered in 1 cycle::
> 

If we are talking about SIMD of the same width (measured in bits) as
SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
that fully pipelined binary128 operations are none starter, because it
would blow your power and thermal budget. One full-width result (i.e. 8
binary128 results) every 2 cycles sounds somewhat more realistic.
After all in general-purpose CPU binary128, if at all implemented, is
a proverbial tail that can't be allowed to wag the dog.

OTOH, if we define our binary128 to use only least-significant 128 bit
lane of our 512-bit register and only build b128 capabilities into one
of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
like a good choice, at least from power/thermal perspective. That is,
as lng as designers found a way to avoid a hot spot.

> 128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.
> 
> 128-bit Fmul requires that the multiplier tree be 64×64 instead of
> 53×53 (1.46× bigger tree, 1.22× bigger FU), and would/should be 3-4
> cycles longer than 64-bit Fmul. If you wanted to be "really clever"
> you could use a 59×59 tree and the FU is only 1.12× bigger; but here
> you could not use the tree for Integer MUL.
> 
> > Terje
> >

[toc] | [prev] | [next] | [standalone]

#114688 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-07 22:10 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1767823827-5857@newsgrouper.org>
In reply to	#114686

Michael S <already5chosen@yahoo.com> posted:

> On Wed, 07 Jan 2026 20:05:17 GMT
> MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
> 
> > Terje Mathisen <terje.mathisen@tmsw.no> posted:
> > 
> > > John Dallman wrote:  
> > > > In article <20260107133424.00000e99@yahoo.com>,
> > > > already5chosen@yahoo.com (Michael S) wrote:
> > > >   
> > > >> I already asked you couple of years ago how fast do want
> > > >> binary128 in order to consider it fast enough.
> > > >> IIRC, you either avoided the answer completely or gave totally
> > > >> unrealistic answer like "the same as binary64".
> > > >> May be, nobody bites because with non-answers or answers like
> > > >> that nobody thinks that you are serious?  
> > > > 
> > > > I don't know much about hardware design. What is realistic for
> > > > hardware binary128?  
> > > 
> > > Sub-10 cycles fmul/fadd/fsub seems very doable?
> > > 
> > > Mitch?
> > >   
> > Assuming 128-bit operands are delivered in 1 cycle and 128-bit
> > results are delivered in 1 cycle::
> > 
> 
> If we are talking about SIMD of the same width (measured in bits) as
> SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
> and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
> that fully pipelined binary128 operations are none starter, because it
> would blow your power and thermal budget.

I agree, however a single 128-bit FPU would fit inside a reasonable
power budget.

>                                           One full-width result (i.e. 8
> binary128 results) every 2 cycles sounds somewhat more realistic.

Likely still over a reasonable power budget.

> After all in general-purpose CPU binary128, if at all implemented, is
> a proverbial tail that can't be allowed to wag the dog.

We build (and call) our current machines 64-bits because that is the
size of the register files (not including SIMD/Vector) and because
we can run the scalar unit at rated clock frequency (non SIMD/Vector)
essentially continuously.

Once we step over the scalar width, power goes up 2×-4× and we get a
couple of hundred cycles before frequency throttling. Thus, we cannot
in general, run SIMD/Vector at rated frequency continuously. Nor can
we, at present time, build a memory system than can properly feed a
SIMD/Vector RF so that one can use all of the lanes of available
calculations. {HBM is approaching this point, however--it becomes
more like B-memory from CRAY-2; than main memory for applications
that can use that much b-memory effectively.}
 
> OTOH, if we define our binary128 to use only least-significant 128 bit
> lane of our 512-bit register and only build b128 capabilities into one
> of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
> like a good choice, at least from power/thermal perspective. That is,
> as long as designers found a way to avoid a hot spot.

We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
and still not need SIMD/Vectors.
 
> > 128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.
> > 
> > 128-bit Fmul requires that the multiplier tree be 64×64 instead of
> > 53×53 (1.46× bigger tree, 1.22× bigger FU), and would/should be 3-4
> > cycles longer than 64-bit Fmul. If you wanted to be "really clever"
> > you could use a 59×59 tree and the FU is only 1.12× bigger; but here
> > you could not use the tree for Integer MUL.
> > 
> > > Terje
> > >   
> 
>

[toc] | [prev] | [next] | [standalone]

#114717 — Re: floating point history, word order and byte order

From	antispam@fricas.org (Waldek Hebisch)
Date	2026-01-11 00:59 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<10jusm5$2q67j$2@paganini.bofh.team>
In reply to	#114688

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
> 
> Michael S <already5chosen@yahoo.com> posted:
> 
>> On Wed, 07 Jan 2026 20:05:17 GMT
>> MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
>> 
>> > Terje Mathisen <terje.mathisen@tmsw.no> posted:
>> > 
>> > > John Dallman wrote:  
>> > > > In article <20260107133424.00000e99@yahoo.com>,
>> > > > already5chosen@yahoo.com (Michael S) wrote:
>> > > >   
>> > > >> I already asked you couple of years ago how fast do want
>> > > >> binary128 in order to consider it fast enough.
>> > > >> IIRC, you either avoided the answer completely or gave totally
>> > > >> unrealistic answer like "the same as binary64".
>> > > >> May be, nobody bites because with non-answers or answers like
>> > > >> that nobody thinks that you are serious?  
>> > > > 
>> > > > I don't know much about hardware design. What is realistic for
>> > > > hardware binary128?  
>> > > 
>> > > Sub-10 cycles fmul/fadd/fsub seems very doable?
>> > > 
>> > > Mitch?
>> > >   
>> > Assuming 128-bit operands are delivered in 1 cycle and 128-bit
>> > results are delivered in 1 cycle::
>> > 
>> 
>> If we are talking about SIMD of the same width (measured in bits) as
>> SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
>> and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
>> that fully pipelined binary128 operations are none starter, because it
>> would blow your power and thermal budget.
> 
> I agree, however a single 128-bit FPU would fit inside a reasonable
> power budget.
> 
>>                                           One full-width result (i.e. 8
>> binary128 results) every 2 cycles sounds somewhat more realistic.
> 
> Likely still over a reasonable power budget.
> 
>> After all in general-purpose CPU binary128, if at all implemented, is
>> a proverbial tail that can't be allowed to wag the dog.
> 
> We build (and call) our current machines 64-bits because that is the
> size of the register files (not including SIMD/Vector) and because
> we can run the scalar unit at rated clock frequency (non SIMD/Vector)
> essentially continuously.
> 
> Once we step over the scalar width, power goes up 2×-4× and we get a
> couple of hundred cycles before frequency throttling. Thus, we cannot
> in general, run SIMD/Vector at rated frequency continuously.

I understand that mutipliers are big and power hungry.  I know
almost nothing about permute unit, but it too looks like big
and power hungry thing.  But how bad is it when one is doing
simple operations say mostly in registers.

> Nor can
> we, at present time, build a memory system than can properly feed a
> SIMD/Vector RF so that one can use all of the lanes of available
> calculations.

There is matrix multiply which is doing n^3 multiplies on n^2
data.  I need polynomial mutiplication, that is n^2 multiplies
on size n data.  There are real computations where a piece or
two pieces of data got trough several steps.  So there is a
lot of compute-intensive problems where processing units can
do work on data in registers or from L1 cache.

So if compute units can do the work, it is still useful,
iven if other problem are memory bound.

> {HBM is approaching this point, however--it becomes
> more like B-memory from CRAY-2; than main memory for applications
> that can use that much b-memory effectively.}
>  
>> OTOH, if we define our binary128 to use only least-significant 128 bit
>> lane of our 512-bit register and only build b128 capabilities into one
>> of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
>> like a good choice, at least from power/thermal perspective. That is,
>> as long as designers found a way to avoid a hot spot.
> 
> We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
> and still not need SIMD/Vectors.
>  
>> > 128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.
>> > 
>> > 128-bit Fmul requires that the multiplier tree be 64×64 instead of
>> > 53×53 (1.46× bigger tree, 1.22× bigger FU), and would/should be 3-4
>> > cycles longer than 64-bit Fmul. If you wanted to be "really clever"
>> > you could use a 59×59 tree and the FU is only 1.12× bigger; but here
>> > you could not use the tree for Integer MUL.
>> > 
>> > > Terje
>> > >   
>> 
>> 

-- 
                              Waldek Hebisch

[toc] | [prev] | [next] | [standalone]

#114683 — Re: floating point history, word order and byte order

From	BGB <cr88192@gmail.com>
Date	2026-01-07 14:23 -0600
Subject	Re: floating point history, word order and byte order
Message-ID	<10jmfbj$vf5g$1@dont-email.me>
In reply to	#114663

On 1/7/2026 7:16 AM, John Dallman wrote:
> In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
> (Michael S) wrote:
> 
>> I already asked you couple of years ago how fast do want binary128
>> in order to consider it fast enough.
>> IIRC, you either avoided the answer completely or gave totally
>> unrealistic answer like "the same as binary64".
>> May be, nobody bites because with non-answers or answers like that
>> nobody thinks that you are serious?
> 
> I don't know much about hardware design. What is realistic for hardware
> binary128?
> 

Likely estimate for FPGA:
   Around 28 DSP48's for a "triangular" multiplier;
     Would need to add several clock cycles for the adder tree;
     ...
   FADD/FSUB unit, also around 12 cycles,
     as most intermediate steps now take 2 clock cycles;

Estimate:
Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12 
clock cycles.
Probably around 12k LUTs for FADD/FSUB unit;
Will need a few more kLUT for the glue logic.

So, will put the cost at:
   18-20 kLUT likely;
   ~ 28 DSP48s;
   Around 12 cycles of latency.

What about an FMA based implementation:
   Probably 49 DSP48's and around 24 cycles of latency.
     Where, 49 is needed for full-width multiplier results.
     Also add a big bump to the LUT cost vs separate units.
   An FMA unit roughly has the latency cost of both the FADD and FMUL.
But, some people really like the ability to quickly have single-rounded 
results.

The initial FPU would likely take around 1/3 of the total LUT budget of 
an XC7A100T, and is unclear if such a thing would be possible within a 
50 MHz CPU core (might require dropping to 33 MHz or similar).

In my case, similar issues wrecked my ideas of doing a 96 bit truncated 
format, and even then 96 bit is still less than 128 bit. My current 
strategy is to instead allow for trap-based handling or hot-patching.

To simplify a hot-patching implementation, I am now considering having 
the compiler set aside roughly 4 instruction-words of "hot patch zone" 
for any instruction that is likely to be implemented via hot-patching.

These would be dumped out in blobs within 1MB of the target, or at the 
end of ".text", whichever comes first. Technically, 3 words would be the 
minimum, but 4 allows for a little more working flexibility.

May make sense to assume that the hot-patching is free to stomp X5, as 
this would make it possible to implement on RV64G. Though, would need 6 
to allow for AUIPC+LD+JALR; but still works if assuming AUIPC+JALR (+/- 
4GB).

This would give space for the handler to replace the offending 
instruction with a JAL, and then to branch off to whatever memory is 
being used for hot-patched instruction sequences.

Granted, this sort of thing only works well if one assumes compiler 
cooperation.

Current possibility is that the compiler could hint at these spaces by 
filling them with a special instruction, such as:
   JALR X0, 0(X0)  //branch to NULL
Where, if the loader or trap handler sees large blobs of such an 
instruction, it can assume that this area was set aside for use by the 
hot patching to reuse to encode long-distance branches.

Could probably also add this to XG1/XG2 if trying to do similar (like 
enabling the "FPUX" extension), may make sense to find some other filler 
instruction that makes sense for XG2 though (using RISC-V JALR 
instructions would be a little out of place in this case).

Granted, could also make sense to use a large blob of EBREAK or similar, 
which could have a similar effect (mostly depends on the probability 
that a program would have some other likely reason to have a big blob of 
EBREAK's, and EBREAK has a higher probability to be "actually useful" 
than a JALR-NULL).

Granted, one could argue that using pad-space defeats the merit of using 
trapping-instructions rather than runtime calls. But, alas...

Ironically, for my RV+SIMD stuff, partly leaned partly into still using 
runtime calls for some operations rather than doing them inline, as 
doing them inline is more bulk with a comparably weaker SIMD ISA (but, 
with some more fiddling, weak SIMD still a big improvement over no-SIMD 
for things like GLQuake).

Well, and more fiddling to make RV FPU handling by BGBCC less crappy:
   More likely to use the correct registers, etc.

And, currently putting 128-bit SIMD in FPU register pairs, which is 
mostly less-bad than GPRs even in the absence of native SIMD ops, apart 
from the "epic crapiness" or trying to deal with shuffle operations (I 
did add "FPU PACK" style instructions as otherwise this part is "dog crap").

Or, in RV terms, one has, say:
   PACK   Rd, Rs1, Rs2  // { Rs1[31: 0], Rs2[31: 0] }
   PACKU  Rd, Rs1, Rs2  // { Rs1[63:32], Rs2[63:32] }
BitManip stopped there, my case would also have PACKBT/PACKTB, though in 
my ISA they were called MOVLD/MOVHD/MOVLHD/MOVHLD (BGBCC still mostly 
uses these names, but allows PACK/PACKU for ASM code). The RV P 
extension also defines all 4 cases, but only for GPRs.

For sake of sanity, my SIMD extension had also defined variants for 
FPRs, albeit still using the same mnemonics (the assembler figures out 
what to do based on registers here).

So, in this case, still more sensible to use internal runtime calls for 
operations like DotProduct and CrossProduct and similar (but are likely 
remain as inline operations for XG2/3).

Similar also applies to complex-number and quaternion operations, which 
will mostly remain as runtime calls.

As noted, no current plans to move beyond 64/128 bit SIMD.

Most likely option is that, rather than (hypothetically) define any sort 
of large-vector SIMD, may make more sense to fake large SIMD via the 
RV-V extension, and then probably use hot-patching to pretend that it 
exists (if needed, by faking RV-V on top of the narrower SIMD).

Like, by the time one wants crap like AVX or similar, then RV-V starts 
to seem more sane.

Big problem-case is when one wants something more like MMX or SSE-1, 
where RV-V seems like a pretty big ask to expect a hardware implementation.

But, a hot-patching implementation could potentially be fast enough to 
make RV-V "not totally worthless" (if faking 256 bit vectors or similar, 
it is then more likely to eat the relative overhead of the patch-calls). 
And, could then implement native RV-V for hardware that can justify the 
cost.

...

[toc] | [prev] | [next] | [standalone]

#114690 — Re: floating point history, word order and byte order

From	Robert Finch <robfi680@gmail.com>
Date	2026-01-07 21:16 -0500
Subject	Re: floating point history, word order and byte order
Message-ID	<10jn429$15rmi$1@dont-email.me>
In reply to	#114683

On 2026-01-07 3:23 p.m., BGB wrote:
> On 1/7/2026 7:16 AM, John Dallman wrote:
>> In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
>> (Michael S) wrote:
>>
>>> I already asked you couple of years ago how fast do want binary128
>>> in order to consider it fast enough.
>>> IIRC, you either avoided the answer completely or gave totally
>>> unrealistic answer like "the same as binary64".
>>> May be, nobody bites because with non-answers or answers like that
>>> nobody thinks that you are serious?
>>
>> I don't know much about hardware design. What is realistic for hardware
>> binary128?
>>
> 
> Likely estimate for FPGA:
>    Around 28 DSP48's for a "triangular" multiplier;
>      Would need to add several clock cycles for the adder tree;
>      ...
>    FADD/FSUB unit, also around 12 cycles,
>      as most intermediate steps now take 2 clock cycles;
> 
> 
> Estimate:
> Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12 
> clock cycles.
> Probably around 12k LUTs for FADD/FSUB unit;
> Will need a few more kLUT for the glue logic.
> 
> So, will put the cost at:
>    18-20 kLUT likely;
>    ~ 28 DSP48s;
>    Around 12 cycles of latency.
> 
> 
> What about an FMA based implementation:
>    Probably 49 DSP48's and around 24 cycles of latency.
>      Where, 49 is needed for full-width multiplier results.
>      Also add a big bump to the LUT cost vs separate units.
>    An FMA unit roughly has the latency cost of both the FADD and FMUL.
> But, some people really like the ability to quickly have single-rounded 
> results.
> 
> 

The 128-bit FMA I implemented with an eight-cycle latency, uses 36 DSPs 
(Karatsuba multiplier). The latency is a bit less than double for an 
FADD. One cycle can be trimmed off operand decoding that can happen in 
parallel, then there is only a single normalization and round taking 
place which also trims a couple of clocks off the double latency.

My FADD has a five-cycle latency. Latency is a bit of a designer’s 
choice and can be setup as desired for the clock frequency. I picked 
eight to try and match the FP clock to the CPU clock (slow CPU clock). 
Many more stages could be added to bump up the clock frequency.

The FMA consumes about 8600 LUTs and 2600 FFs. I decided to use FMAs 
(without FADD, FMUL) in my design even though the latency is a bit more 
as I think the total LUT cost is lower.

<snip>

[toc] | [prev] | [next] | [standalone]

#114784 — Re: floating point history, word order and byte order

From	kegs@provalid.com (Kent Dickey)
Date	2026-01-28 13:25 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<10ld2nd$hkt8$1@dont-email.me>
In reply to	#114663

In article <memo.20260107131628.5352Z@jgd.cix.co.uk>,
John Dallman <jgd@cix.co.uk> wrote:
>In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
>(Michael S) wrote:
>
>> I already asked you couple of years ago how fast do want binary128 
>> in order to consider it fast enough.
>> IIRC, you either avoided the answer completely or gave totally
>> unrealistic answer like "the same as binary64".
>> May be, nobody bites because with non-answers or answers like that
>> nobody thinks that you are serious?
>
>I don't know much about hardware design. What is realistic for hardware
>binary128? 
>
>John 

I'm late to this, but I thought I'd point out for very low hardware cost
(mostly control complexity), you can implement 128-bit FP FMUL at about 2x
the latency of 64-bit FMUL and 1/4th the throughput, with some control
complexity but no significant storage or data buses.  Sadly, 128-bit FADD
and FMA seems to require more resources over 64-bit FADD/FMA.

128-bit FP has a sign bit, a 15-bit exponent, and a 112-bit mantissa.
If you divide this into 64-bit pieces, then the low part has mantissa[63:0],
and the upper part has { sign, exp[15:0], mantissa[111:64] }.
64-bit FP is {sign, exp[10:0], mantissa[51:0] }

Assume 64-bit FMUL takes 4 cycles, and is fully pipelined (can start a new
operation every clock, results come out 4 clocks later).  Assume you have
a 53*53 = 106 bit multiplier for 64-bit FMA.  We need to widen this to a
57*57 multiplier, and widen the adders/accumulators as well from ~107 bits
to ~115 bits.  This has a cost, but it's small.  We also need to provide the
input registers over 4 clocks, over the existing 64-bit wide buses, but this
is a control complexity.  We'll effectively do 4 partial multiplies using
the existing 64-bit paths, and then combine them into the 128-bit result,
which will be generated over 2 clocks.  So all in/out paths stay 64-bit.

Divide the 128-bit FP mantissa into low[56:0] which is mantissa[56:0] and
high[56:0] = 01,mantissa[111:57].  So send down the FMUL pipeline new register
inputs over 4 clocks.  We'll call the 128-bit FP register A, which consists
of 64-bit register A_low and 64-bit register A_high, and the other operand
is B.

Clock 0: A_low*B_low.  Result comes in clock 4
Clock 1: A_low*B_high. Result comes in clock 5
Clock 2: A_high*B_low. Result comes in clock 6
Clock 3: A_high*B_high.  Result comes in clock 7.

Now we just need to wait for those results to arrive, no more register
values are fed to FMUL each clock for the next steps.

CLock 4: Sum[56:0] = low*low shift right 57 (track sticky bit)
Clock 5: Sum[114:0] += low*high
Clock 6: Sum[58:0] += high*low, shift right 57 (track sticky bit)
Clock 7: Sum[114:0] += high*high.  Do rounding.  Return low [63:0] of result
Clock 8: Return high part of result (fixing up exponent) to return [127:64]

The Sum only works on ~115 bits, which is only a little more than the 64-bit
FP 108 bits needed for 64-bit FMA.

When the unit receives A_low and B_low, it's getting 7 mantissa bits it won't
use (since it's just using [56:0] in the multiply).  It needs to save those
bits to be part of high in the future cycles, since when the register file
supplies the High parts in later cycles, it will be mantissa[111:64], and we
need mantissa[111:57].

128-bit FADD does not break into parts so easily.  So adding 128-bit FP
really then becomes how much space you want to put into the 128-bit FADD.
I think it can be done just with extra storage, keeping the adder to
115 bits (for FMA), but it requires a late fixup since the magnitude of the
two items being added isn't known when the add must begin.

This strategy trades off performance for low resource usage, which I think
is an excellent tradeoff for 128-bit FP.  And it will have the same power
profile as 64-bit FP--running 128-bit FP flat out is about the same power as
running 64-bit FP flat out.

Kent

[toc] | [prev] | [next] | [standalone]

#114788 — Re: floating point history, word order and byte order

From	BGB <cr88192@gmail.com>
Date	2026-01-28 15:26 -0600
Subject	Re: floating point history, word order and byte order
Message-ID	<10ldv36$qpj3$1@dont-email.me>
In reply to	#114784

On 1/28/2026 7:25 AM, Kent Dickey wrote:
> In article <memo.20260107131628.5352Z@jgd.cix.co.uk>,
> John Dallman <jgd@cix.co.uk> wrote:
>> In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
>> (Michael S) wrote:
>>
>>> I already asked you couple of years ago how fast do want binary128
>>> in order to consider it fast enough.
>>> IIRC, you either avoided the answer completely or gave totally
>>> unrealistic answer like "the same as binary64".
>>> May be, nobody bites because with non-answers or answers like that
>>> nobody thinks that you are serious?
>>
>> I don't know much about hardware design. What is realistic for hardware
>> binary128?
>>
>> John
> 
> I'm late to this, but I thought I'd point out for very low hardware cost
> (mostly control complexity), you can implement 128-bit FP FMUL at about 2x
> the latency of 64-bit FMUL and 1/4th the throughput, with some control
> complexity but no significant storage or data buses.  Sadly, 128-bit FADD
> and FMA seems to require more resources over 64-bit FADD/FMA.
> 
> 128-bit FP has a sign bit, a 15-bit exponent, and a 112-bit mantissa.
> If you divide this into 64-bit pieces, then the low part has mantissa[63:0],
> and the upper part has { sign, exp[15:0], mantissa[111:64] }.
> 64-bit FP is {sign, exp[10:0], mantissa[51:0] }
> 
> Assume 64-bit FMUL takes 4 cycles, and is fully pipelined (can start a new
> operation every clock, results come out 4 clocks later).  Assume you have
> a 53*53 = 106 bit multiplier for 64-bit FMA.  We need to widen this to a
> 57*57 multiplier, and widen the adders/accumulators as well from ~107 bits
> to ~115 bits.  This has a cost, but it's small.  We also need to provide the
> input registers over 4 clocks, over the existing 64-bit wide buses, but this
> is a control complexity.  We'll effectively do 4 partial multiplies using
> the existing 64-bit paths, and then combine them into the 128-bit result,
> which will be generated over 2 clocks.  So all in/out paths stay 64-bit.
> 
> Divide the 128-bit FP mantissa into low[56:0] which is mantissa[56:0] and
> high[56:0] = 01,mantissa[111:57].  So send down the FMUL pipeline new register
> inputs over 4 clocks.  We'll call the 128-bit FP register A, which consists
> of 64-bit register A_low and 64-bit register A_high, and the other operand
> is B.
> 
> Clock 0: A_low*B_low.  Result comes in clock 4
> Clock 1: A_low*B_high. Result comes in clock 5
> Clock 2: A_high*B_low. Result comes in clock 6
> Clock 3: A_high*B_high.  Result comes in clock 7.
> 
> Now we just need to wait for those results to arrive, no more register
> values are fed to FMUL each clock for the next steps.
> 
> CLock 4: Sum[56:0] = low*low shift right 57 (track sticky bit)
> Clock 5: Sum[114:0] += low*high
> Clock 6: Sum[58:0] += high*low, shift right 57 (track sticky bit)
> Clock 7: Sum[114:0] += high*high.  Do rounding.  Return low [63:0] of result
> Clock 8: Return high part of result (fixing up exponent) to return [127:64]
> 
> The Sum only works on ~115 bits, which is only a little more than the 64-bit
> FP 108 bits needed for 64-bit FMA.
> 
> When the unit receives A_low and B_low, it's getting 7 mantissa bits it won't
> use (since it's just using [56:0] in the multiply).  It needs to save those
> bits to be part of high in the future cycles, since when the register file
> supplies the High parts in later cycles, it will be mantissa[111:64], and we
> need mantissa[111:57].
> 
> 128-bit FADD does not break into parts so easily.  So adding 128-bit FP
> really then becomes how much space you want to put into the 128-bit FADD.
> I think it can be done just with extra storage, keeping the adder to
> 115 bits (for FMA), but it requires a late fixup since the magnitude of the
> two items being added isn't known when the add must begin.
> 
> This strategy trades off performance for low resource usage, which I think
> is an excellent tradeoff for 128-bit FP.  And it will have the same power
> profile as 64-bit FP--running 128-bit FP flat out is about the same power as
> running 64-bit FP flat out.
> 

Sort of reminds me of one case where I evaluated the possibility of a 
64-bit hardware multiplier which would internally decompose it into 
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than 
doing it in software (using the same general strategy). Eventually did 
end up adding a (significantly slower, but cheaper) shift-and-add 
hardware multiplier.

The shift-and-add multiplier can be made to do FMUL and FDIV, and in 
theory could be made to function as a more affordable FP128 FPU as well 
(probably also providing Int128 MUL/DIV in the process), but... Would 
likely be slower than doing it in software.

I had started moving away from using it for FDIV mostly because:
   Using it for this is not *that* much faster than trap and emulate;
     Trap is slower, but T&E is within an order of magnitude...
   Once implemented, hot-patching is likely to be faster.
     The downside being mostly that hot-patching is more complex.

Though, for BGBCC, am mostly using runtime calls for FP divide, as in 
this case it is the fastest option, albeit not the most space efficient 
(trap&emulate uses less space, hot patching would be intermediate here, 
but now the cost is paid by the OS).

Though, one other option (IIRC, used once in the past in one of my code 
generators) was to call generated functions which instead behaves more 
like instructions: The function itself was wired for fixed input and 
output registers, and would then provide the code to move them into the 
ABI registers and then call the generic function. This sort of thing 
made more sense in a JIT compiler. But, this could consolidate some of 
the space cost of the runtime calls if they happen to use the same 
registers (if not, it is worse than just calling the function using the 
normal ABI).

...

But, progress is slowing down in this space.
My most recent activity was trying to sort out some of the inconsistent 
handling of the TST/NTST instructions in XG3; had ended up in a 
situation where behavior was inconsistent between the CPU core, 
emulator, compiler, and specs regarding which exact behavior was used. 
Partly due to an issue where in the emulator had (unintentionally) ended 
up with a situation where which behavior it gives (for true/false 
status) would depend on which registers were used. Now in theory it 
should be sorted out (mostly effected XG3 when using Predication; mostly 
confusion resulting from a bug in the instruction decoding in my emulator).

Technically has led to inconsistent naming with my RV-JX specs, need to 
fully decide on the "polarity" of the TST/NTST mnemonics.

Can note:
   SH  : TST was (Rm&Rn)==0
   BJX2: Intent was:
     2R: TST is ((Rm&Rn)==0)
     3R: TST is ((Rm&Rn)!=0), NTST is ((Rm&Rn)==0)
       BGBCC currently following this pattern internally.
     Current implementation results in the mnemonics being backwards.
       Within the ISA spec, they effectively have the SH behaviors.
     Within the RV-JX spec, they have the intended behaviors.

So, need to decide whether it is better to go back to SH convention (so 
2R and 3R agree on what the mnemonic does), flipping them in the RV-JX 
spec for consistency, leaving BGBCC as the odd-man-out. Or, flip the 
nmemonics in the XG2 and XG3 specs to match actual behavior (well, more 
likely; would match original intention at least).

Well, and for consistency with BTST/BNTST, where BTST are "Branch is 
(Rs&Rt)!=0".

This sort of thing almost feels like an engineering fail tough.

Though, mostly unrelated:
In my current 3D engine, I ended up increasing the world height from 128 
to 384 blocks in a way that mostly doesn't break anything, and doesn't 
have much additional memory cost. It uses vertically stacked regions, 
but only generates the upper or lower regions if the player enters them 
(and, if so, fills the sky region with air, or the underground region 
with stone).

In theory, could have gone higher, but ran into a limit mostly with the 
way I was representing world coords:
   (23: 0): X position, as 16.8 fixed point.
   (47:24): Y position, as 16.8 fixed point.
   (63:48): Z position (up/down), originally as 8.8 fixed point.

Increasing the world height was handled by changing the Z position into 
9.7, which mostly worked.

Had noted that I can't really go to 10.6 though, as the loss of 
precision in this case is enough to effectively break the ray-casting.

Where, in this case, the 3D engine determines visible blocks by firing 
off rays from the POV of the player, and marking everything that is hit 
by a ray as visible. At 6 bits, the rays stop going up/down as 
effectively, resulting in missing blocks mostly along the ground and 
ceilings (every ground plane block past a certain radius effectively 
disappears, which is very obvious/ugly).

But, I cut the bit off of Z mostly as the player mostly slides along the 
ground plane, and a loss of Z positional accuracy was likely better here 
than a loss of X or Y positional accuracy.

I am half tempted to consider doing more like my old engine (BT2) and 
going to using a struct with 3x 20.12 coordinates or similar (or, maybe 
16.16, assuming I keep the same world size).

I am left to wonder if precision going be related to some of the other 
"general glitchiness" with the raycast visibility determination for more 
distant objects, or maybe even if it could make sense to abandon 
ray-casting on PC (vs going the Minecraft route of "just throw 
everything within a certain radius at the GPU and let it sort it out").

Though, can note that for similar draw-distances, BT3 currently uses 
around 1/6 as much RAM as Minecraft. Partly as it it doesn't need to 
load chunks or generate meshes for parts of the world that are not 
currently visible.

But, does have the issue that as things come into view there may be 
momentary delays and "pop-in" before the visibility determination 
realizes they are visible, and parts of the terrain (usually at a 
distance) repeatedly flickering in and out of visibility.

Also increasing ray density increases CPU cost, but not notable solving 
the issue, but this could be due in part to the precision issues.

There was already a trick where the exact origin from which raycasting 
takes place jumps around to better improve ray hits, but likely doesn't 
fully cover for the accuracy issues (and if 6 fractional bits is enough 
to break the ray-casting, means the situation probably isn't exactly 
great with 8 bits).

Where, as noted, in this case, the base unit is a meter, so:
   5 fraction bits: ~ 1.250" ULP (untested, probably worse)
   6 fraction bits: ~ 0.625" ULP (raycast breaks down)
   7 fraction bits: ~ 0.313" ULP (slight increase in raycast issues).
   8 fraction bits: ~ 0.156" ULP (works, OK ish)

The world coords basically need to be able to give coordinates anywhere 
in the world, and due to world structure, in this case fixed-point is 
preferable to floating-point. Note that it is partly independent from 
the coordinate space used for local rendering (camera relative).

Well, there is also the wonk that somewhere along the development path, 
the BT3 engine ended up using a left-handed coordinate space. Almost 
mostly doesn't matter, except when writing scripts to build structures 
and place items, and the X axis being backwards in this case. On one 
imagines a structure placed top-down with +Y as "up", then +X is left.

Could place a structure as +X in which case +Y is right, so slightly 
less wonky, but then X/Y are backwards. Almost tempting to consider 
adding an option to switch X/Y here, to at least allow scripts 
optionally to pretend it is in a right-hand space. Could switch the 
engine (effectively leading to wonk in the terrain system and/or 
mirroring the world), possibly a bigger hassle.

Though, I think part of it may have been that when I was originally 
writing it, it was for TKRA-GL, and early on it ended up with the OpenGL 
effectively flipped vertically (initial development using a +Y=down 
framebuffer, but GL still doing all math as-if it were a +Y=up 
framebuffer); and I later ended up needing to flip it vertically to work 
with more normal OpenGL behavior (but, in the process of this wonk, had 
ended up in a LHS coordinate space).

Well, say, where traditional hardware framebuffers had their origin in 
the top-left corner (so +Y=down), but Windows/BMP/etc, have (0,0) in the 
lower left, and typically for GL one assumes +Y is up. But, wonk results 
if one does it in a +Y=down context and later flips the rendering so 
that the image isn't upside down.

Note that the later move to TKGDI in TestKern (vs directly sending the 
framebuffer to the HW) also switched to origin in the lower-left (but, 
then one also ends up with inconsistency in image file formats as to the 
relative assumptions about raster order).

...

Other related mysteries are whether for storing something resembling 2D 
images, if the RLE compression could be improved significantly by moving 
from raster to Hilbert order, or if the added complexity of Hilbert 
order would cause it to make more sense to just go over to LZ77 or similar.

But, yeah, ...

...

[toc] | [prev] | [next] | [standalone]

#114789 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-28 23:03 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1769641436-5857@newsgrouper.org>
In reply to	#114788

BGB <cr88192@gmail.com> posted:

> On 1/28/2026 7:25 AM, Kent Dickey wrote:
>
> 
> Sort of reminds me of one case where I evaluated the possibility of a 
> 64-bit hardware multiplier which would internally decompose it into 
> 32x32->64 bit widening multiplies and add the parts back together.
> 
> Then noted the drawback that this wouldn't have been much faster than 
> doing it in software (using the same general strategy). Eventually did 
> end up adding a (significantly slower, but cheaper) shift-and-add 
> hardware multiplier.

Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32×32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

[toc] | [prev] | [next] | [standalone]

#114790 — Re: floating point history, word order and byte order

From	BGB <cr88192@gmail.com>
Date	2026-01-28 17:43 -0600
Subject	Re: floating point history, word order and byte order
Message-ID	<10le73e$tb6q$1@dont-email.me>
In reply to	#114789

On 1/28/2026 5:03 PM, MitchAlsup wrote:
> 
> BGB <cr88192@gmail.com> posted:
> 
>> On 1/28/2026 7:25 AM, Kent Dickey wrote:
>>
>>
>> Sort of reminds me of one case where I evaluated the possibility of a
>> 64-bit hardware multiplier which would internally decompose it into
>> 32x32->64 bit widening multiplies and add the parts back together.
>>
>> Then noted the drawback that this wouldn't have been much faster than
>> doing it in software (using the same general strategy). Eventually did
>> end up adding a (significantly slower, but cheaper) shift-and-add
>> hardware multiplier.
> 
> Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
> FP32 was 4 cycles, PF64 was 7 cycles.
> 
> When you wanted 32×32->64 there was a 12-cycle instruction sequence
> that would provide--any yes it required extracting 16-bit partials
> multiplying 4 of them and adding them all up.
> 

Similar here:
   32*32=>64: 3-cycle, pipelined;
   Considered hard-wired logic mechanism:
     ~ 12 cycles;
   Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
   Shift-and-add: 68 cycles (same as DIV/REM).
     But, easier to justify the LUTs in the name of RV 'M' support.
     Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap 
and emulate to be acceptable from a performance POV. The slow hardware 
integer divide did manage to outperform using a software 
shift-and-subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more 
complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson is 
still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV, 
assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle, or, 
would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that 
point...

In the latter case, just using the 32-bit widening integer multiplier to 
implement the Binary128 FMUL and using Newton-Raphson is likely to be 
faster.

Main merit of Binary128 though being that "long double" is so 
infrequently used that it almost doesn't matter if it is glacially slow 
(even more so with FDIV, which for many programs might not happen at all).

...

[toc] | [prev] | [next] | [standalone]

#114792 — Re: floating point history, word order and byte order

From	Robert Finch <robfi680@gmail.com>
Date	2026-01-29 03:47 -0500
Subject	Re: floating point history, word order and byte order
Message-ID	<10lf6re$15h6s$1@dont-email.me>
In reply to	#114790

On 2026-01-28 6:43 p.m., BGB wrote:
> On 1/28/2026 5:03 PM, MitchAlsup wrote:
>>
>> BGB <cr88192@gmail.com> posted:
>>
>>> On 1/28/2026 7:25 AM, Kent Dickey wrote:
>>>
>>>
>>> Sort of reminds me of one case where I evaluated the possibility of a
>>> 64-bit hardware multiplier which would internally decompose it into
>>> 32x32->64 bit widening multiplies and add the parts back together.
>>>
>>> Then noted the drawback that this wouldn't have been much faster than
>>> doing it in software (using the same general strategy). Eventually did
>>> end up adding a (significantly slower, but cheaper) shift-and-add
>>> hardware multiplier.
>>
>> Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
>> FP32 was 4 cycles, PF64 was 7 cycles.
>>
>> When you wanted 32×32->64 there was a 12-cycle instruction sequence
>> that would provide--any yes it required extracting 16-bit partials
>> multiplying 4 of them and adding them all up.
>>
> 
> Similar here:
>    32*32=>64: 3-cycle, pipelined;
>    Considered hard-wired logic mechanism:
>      ~ 12 cycles;
>    Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
>    Shift-and-add: 68 cycles (same as DIV/REM).
>      But, easier to justify the LUTs in the name of RV 'M' support.
>      Still faster than trap and emulate.
> 
> Where 64-bit integer MUL and DIV being not quite rare enough for trap 
> and emulate to be acceptable from a performance POV. The slow hardware 
> integer divide did manage to outperform using a software shift-and- 
> subtract loop though (so had that much going for it at least).
> 
> 
> For Binary64, this unit is around 112 cycles for FDIV (due to quirks).
> 
> In the paste, Hardware Newton-Raphson is an option, but is more 
> complicated and expensive to make it work well.
> 
> The FMUL is a fair bit faster, and this means software Newton-Raphson is 
> still the most attractive option from the performance POV.
> 
> 
> 
> 
> If done for Binary128, would be around 228 cycles for FMUL and FDIV, 
> assuming the Shift-and-Add unit remains 1 bit per cycle.
> There is concern that internal latency could require 0.5 bit/cycle, or, 
> would-be 456 cycles.
> 
> If it were 456 cycles, may as well just use trap-and-emulate at that 
> point...
> 
> In the latter case, just using the 32-bit widening integer multiplier to 
> implement the Binary128 FMUL and using Newton-Raphson is likely to be 
> faster.
> 
> Main merit of Binary128 though being that "long double" is so 
> infrequently used that it almost doesn't matter if it is glacially slow 
> (even more so with FDIV, which for many programs might not happen at all).
> 
> ...
> 
> 

I seem to find that it is difficult to get better performance for FDIV 
than using a simple divider.

FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So 
performing three or four iterations of NR in software (60 to 80 clocks) 
is just about as time consuming as using a divider.

For FDIV (or FMUL) with a radix-2 divide it can probably operate at 
double the CPU clock frequency. For instance the FDIV in my float 
package runs at almost 300 MHz. But the CPU can only be clocked about 
100 MHz. So a double-frequency clock is used for FDIV. This cuts the 
relative latency in half. (60 CPU clocks).

I could maybe better balance the timing in the FMA to reduce the latency 
somewhat and still keep the same FMAX. The 64x64 multiply has by itself 
about 11 cycles of latency. Built up out of 16x16 multipliers.

[toc] | [prev] | [next] | [standalone]

#114798 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-29 21:30 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1769722249-5857@newsgrouper.org>
In reply to	#114792

Robert Finch <robfi680@gmail.com> posted:

> On 2026-01-28 6:43 p.m., BGB wrote:
> > On 1/28/2026 5:03 PM, MitchAlsup wrote:
> >>
> >> BGB <cr88192@gmail.com> posted:
> >>
> >>> On 1/28/2026 7:25 AM, Kent Dickey wrote:
> >>>
> >>>
> >>> Sort of reminds me of one case where I evaluated the possibility of a
> >>> 64-bit hardware multiplier which would internally decompose it into
> >>> 32x32->64 bit widening multiplies and add the parts back together.
> >>>
> >>> Then noted the drawback that this wouldn't have been much faster than
> >>> doing it in software (using the same general strategy). Eventually did
> >>> end up adding a (significantly slower, but cheaper) shift-and-add
> >>> hardware multiplier.
> >>
> >> Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
> >> FP32 was 4 cycles, PF64 was 7 cycles.
> >>
> >> When you wanted 32×32->64 there was a 12-cycle instruction sequence
> >> that would provide--any yes it required extracting 16-bit partials
> >> multiplying 4 of them and adding them all up.
> >>
> > 
> > Similar here:
> >    32*32=>64: 3-cycle, pipelined;
> >    Considered hard-wired logic mechanism:
> >      ~ 12 cycles;
> >    Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
> >    Shift-and-add: 68 cycles (same as DIV/REM).
> >      But, easier to justify the LUTs in the name of RV 'M' support.
> >      Still faster than trap and emulate.
> > 
> > Where 64-bit integer MUL and DIV being not quite rare enough for trap 
> > and emulate to be acceptable from a performance POV. The slow hardware 
> > integer divide did manage to outperform using a software shift-and- 
> > subtract loop though (so had that much going for it at least).
> > 
> > 
> > For Binary64, this unit is around 112 cycles for FDIV (due to quirks).
> > 
> > In the paste, Hardware Newton-Raphson is an option, but is more 
> > complicated and expensive to make it work well.
> > 
> > The FMUL is a fair bit faster, and this means software Newton-Raphson is 
> > still the most attractive option from the performance POV.
> > 
> > 
> > 
> > 
> > If done for Binary128, would be around 228 cycles for FMUL and FDIV, 
> > assuming the Shift-and-Add unit remains 1 bit per cycle.
> > There is concern that internal latency could require 0.5 bit/cycle, or, 
> > would-be 456 cycles.
> > 
> > If it were 456 cycles, may as well just use trap-and-emulate at that 
> > point...
> > 
> > In the latter case, just using the 32-bit widening integer multiplier to 
> > implement the Binary128 FMUL and using Newton-Raphson is likely to be 
> > faster.
> > 
> > Main merit of Binary128 though being that "long double" is so 
> > infrequently used that it almost doesn't matter if it is glacially slow 
> > (even more so with FDIV, which for many programs might not happen at all).
> > 
> > ...
> > 
> > 
> 
> I seem to find that it is difficult to get better performance for FDIV 
> than using a simple divider.
> 
> FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So 
> performing three or four iterations of NR in software (60 to 80 clocks) 
> is just about as time consuming as using a divider.
> 
> For FDIV (or FMUL) with a radix-2 divide it can probably operate at 
> double the CPU clock frequency. For instance the FDIV in my float 
> package runs at almost 300 MHz. But the CPU can only be clocked about 
> 100 MHz. So a double-frequency clock is used for FDIV. This cuts the 
> relative latency in half. (60 CPU clocks).

An SRT step (iteration) can be done several times per cycle,
3 steps per 16 gate cycle is not that hard.
4 steps per 16 gate cycle is on the edge of doable.

64-bit div is thus on the order of 23-cycles (64/3=21+2 pipeline)
whereas a Goldschmidt with NR correction is 17 cycles IEEE correct
where one knows they are within 1 ULP at cycle 12.
 
> I could maybe better balance the timing in the FMA to reduce the latency 
> somewhat and still keep the same FMAX. The 64x64 multiply has by itself 
> about 11 cycles of latency. Built up out of 16x16 multipliers.

I suspect they are making you eat the 32-bit adder from each 16×16
instead of doing every thing in carry-save format until the final add.

A 64×32 Booth recoded Dadda/Walace tree is only 5-layers of 4-2 
compressors {or 10-gates of delay (after recoder fanout)} plus a 
128-bit adder (of your choice) gate delay (say 11-gates of delay);
for a total multiply time of 21 gates or 1.5 cycles.

Add the FP multiplexers, Booth recoding, find first for normalization, 
and you are sitting at 3.3 cycles PLUS wire delay.

[toc] | [prev] | [next] | [standalone]

#114801 — Re: floating point history, word order and byte order

From	BGB <cr88192@gmail.com>
Date	2026-01-29 17:44 -0600
Subject	Re: floating point history, word order and byte order
Message-ID	<10lgrgt$1ms7l$2@dont-email.me>
In reply to	#114792

On 1/29/2026 2:47 AM, Robert Finch wrote:
> On 2026-01-28 6:43 p.m., BGB wrote:
>> On 1/28/2026 5:03 PM, MitchAlsup wrote:
>>>
>>> BGB <cr88192@gmail.com> posted:
>>>
>>>> On 1/28/2026 7:25 AM, Kent Dickey wrote:
>>>>
>>>>
>>>> Sort of reminds me of one case where I evaluated the possibility of a
>>>> 64-bit hardware multiplier which would internally decompose it into
>>>> 32x32->64 bit widening multiplies and add the parts back together.
>>>>
>>>> Then noted the drawback that this wouldn't have been much faster than
>>>> doing it in software (using the same general strategy). Eventually did
>>>> end up adding a (significantly slower, but cheaper) shift-and-add
>>>> hardware multiplier.
>>>
>>> Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
>>> FP32 was 4 cycles, PF64 was 7 cycles.
>>>
>>> When you wanted 32×32->64 there was a 12-cycle instruction sequence
>>> that would provide--any yes it required extracting 16-bit partials
>>> multiplying 4 of them and adding them all up.
>>>
>>
>> Similar here:
>>    32*32=>64: 3-cycle, pipelined;
>>    Considered hard-wired logic mechanism:
>>      ~ 12 cycles;
>>    Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
>>    Shift-and-add: 68 cycles (same as DIV/REM).
>>      But, easier to justify the LUTs in the name of RV 'M' support.
>>      Still faster than trap and emulate.
>>
>> Where 64-bit integer MUL and DIV being not quite rare enough for trap 
>> and emulate to be acceptable from a performance POV. The slow hardware 
>> integer divide did manage to outperform using a software shift-and- 
>> subtract loop though (so had that much going for it at least).
>>
>>
>> For Binary64, this unit is around 112 cycles for FDIV (due to quirks).
>>
>> In the paste, Hardware Newton-Raphson is an option, but is more 
>> complicated and expensive to make it work well.
>>
>> The FMUL is a fair bit faster, and this means software Newton-Raphson 
>> is still the most attractive option from the performance POV.
>>
>>
>>
>>
>> If done for Binary128, would be around 228 cycles for FMUL and FDIV, 
>> assuming the Shift-and-Add unit remains 1 bit per cycle.
>> There is concern that internal latency could require 0.5 bit/cycle, 
>> or, would-be 456 cycles.
>>
>> If it were 456 cycles, may as well just use trap-and-emulate at that 
>> point...
>>
>> In the latter case, just using the 32-bit widening integer multiplier 
>> to implement the Binary128 FMUL and using Newton-Raphson is likely to 
>> be faster.
>>
>> Main merit of Binary128 though being that "long double" is so 
>> infrequently used that it almost doesn't matter if it is glacially 
>> slow (even more so with FDIV, which for many programs might not happen 
>> at all).
>>
>> ...
>>
>>
> 
> I seem to find that it is difficult to get better performance for FDIV 
> than using a simple divider.
> 
> FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So 
> performing three or four iterations of NR in software (60 to 80 clocks) 
> is just about as time consuming as using a divider.
> 
> For FDIV (or FMUL) with a radix-2 divide it can probably operate at 
> double the CPU clock frequency. For instance the FDIV in my float 
> package runs at almost 300 MHz. But the CPU can only be clocked about 
> 100 MHz. So a double-frequency clock is used for FDIV. This cuts the 
> relative latency in half. (60 CPU clocks).
> 
> I could maybe better balance the timing in the FMA to reduce the latency 
> somewhat and still keep the same FMAX. The 64x64 multiply has by itself 
> about 11 cycles of latency. Built up out of 16x16 multipliers.
> 

OK, I have:
   Binary64 FMUL: 6 cycles
   Binary64 FADD: 6 cycles (incl FSUB, Int<->FP)
Via SIMD Unit:
   Binary32 FMUL: 3 cycles (incl SIMD)
   Binary32 FADD: 3 cycles (incl SIMD)
   FMULA/FADDA: Also 3 cycles (Binary64 format at Binary32 precision).

This mostly leaves N-R as the fastest strategy in this case.

No FMA as there isn't really a good way to get the latency low enough 
except in a very niche case of FP8*FP8+FP16, but this would likely only 
really be useful for NN's or similar (not as useful as a general purpose 
SIMD instruction).

Granted, FP8 for inputs/weights and FP16 accumulators does deem to be a 
fairly effective approach for NN's.

...

>

[toc] | [prev] | [next] | [standalone]

#114673 — Re: floating point history, word order and byte order

From	Terje Mathisen <terje.mathisen@tmsw.no>
Date	2026-01-07 18:56 +0100
Subject	Re: floating point history, word order and byte order
Message-ID	<10jm6nt$rrh6$1@dont-email.me>
In reply to	#114661

Michael S wrote:
> On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
> jgd@cix.co.uk (John Dallman) wrote:
> 
>> In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> Possibly.  But the lack of takeup of the Intel library and of the
>>> gcc support shows that "build it and they will come" does not work
>>> out for DFP.
>>
>> The world has got very used to IEEE BFP, and has solutions that work
>> acceptably with it. Lots of organisations don't see anything obvious
>> for them in DFP.
>>
>> The thing I'd like to try out is fast quad-precision BFP. For the
>> field I work in, that would make some things much simpler. I did try
>> to interest AMD in the idea in the early days of x86-64, but they
>> didn't bite.
>>
>> John
> 
> I already asked you couple of years ago how fast do want binary128 in
> order to consider it fast enough.
> IIRC, you either avoided the answer completely or gave totally
> unrealistic answer like "the same as binary64".
> May be, nobody bites because with non-answers or answers like that
> nobody thinks that you are serious?
> 
> Anecdote.
> Few months ago I tried to design very long decimation filters with stop
> band attenuation of ~160 dB.
> Matlab's implementation of Parksâ€“McClellan algorithm (a customize
> variation of Remez Exchange spiced with a small portion of black magic)
> was not up to the task. Gnu Octave implementation was somewhat worse
> yet.
> When I started to investigate the reasons I found out that there were
> actually two of them, both related to insufficient precision of the
> series of DP FP calculations.
> The first was Discreet Cosine Transform and underlying FFT engine for N
> around 32K.
> The second was solving system of linear equations for N around 1000 a
> a little more.
> In both cases precision of DP FP was perfectly sufficient both for
> inputs and for outputs. But errors accumulated in intermediate caused
> troubles.
> 
> In both cases quad-precision FP was key to solution.
> 
> For DCT (FFT) I went for full re-implementation at higher precision.
> 
> For Linear Solver, I left LU decomposition, which happens to be the
> heavy O(N**3) part in DP. Quad-precision was applied only during final
> solver stages - forward propagation, back propagation, calculation of
> residual error vector and repetition of forward and back propagation.
> All those parts are O(N**2). That modification was sufficient to
> improve precision of result almost to the best possible in DP FP format.
> And sufficient for good convergence of Parksâ€“McClellan algorithm.
> 
> 
> So, what is a point of my anecdote?
> The speed of quad-precision FP was never an obstacle, even when running
> on rather old hardware. And it's not like calculations here were not
> heavy. They were heavy all right, thank you very much.
> Using quad-precision only when necessary helped.
> But what helped more is not being hesitant. Doing things instead of
> worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every 
fp programmer needed to also be a fp analyst, capable of carrying out 
error budget calculation across their algorithms.

You can obviously do that, and so can a number of regulars here, but in 
the real world we are in a _very_ small minority.

For the rest, just having fp128 fast enough that that it could be 
applied naively would solve a number of problems.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

[toc] | [prev] | [next] | [standalone]

#114684 — Re: floating point history, word order and byte order

From	BGB <cr88192@gmail.com>
Date	2026-01-07 14:38 -0600
Subject	Re: floating point history, word order and byte order
Message-ID	<10jmg8a$vf5g$2@dont-email.me>
In reply to	#114673

On 1/7/2026 11:56 AM, Terje Mathisen wrote:
> Michael S wrote:
>> On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
>> jgd@cix.co.uk (John Dallman) wrote:
>>
>>> In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
>>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>>
>>>> Possibly.  But the lack of takeup of the Intel library and of the
>>>> gcc support shows that "build it and they will come" does not work
>>>> out for DFP.
>>>
>>> The world has got very used to IEEE BFP, and has solutions that work
>>> acceptably with it. Lots of organisations don't see anything obvious
>>> for them in DFP.
>>>
>>> The thing I'd like to try out is fast quad-precision BFP. For the
>>> field I work in, that would make some things much simpler. I did try
>>> to interest AMD in the idea in the early days of x86-64, but they
>>> didn't bite.
>>>
>>> John
>>
>> I already asked you couple of years ago how fast do want binary128 in
>> order to consider it fast enough.
>> IIRC, you either avoided the answer completely or gave totally
>> unrealistic answer like "the same as binary64".
>> May be, nobody bites because with non-answers or answers like that
>> nobody thinks that you are serious?
>>
>> Anecdote.
>> Few months ago I tried to design very long decimation filters with stop
>> band attenuation of ~160 dB.
>> Matlab's implementation of Parksâ€“McClellan algorithm (a customize
>> variation of Remez Exchange spiced with a small portion of black magic)
>> was not up to the task. Gnu Octave implementation was somewhat worse
>> yet.
>> When I started to investigate the reasons I found out that there were
>> actually two of them, both related to insufficient precision of the
>> series of DP FP calculations.
>> The first was Discreet Cosine Transform and underlying FFT engine for N
>> around 32K.
>> The second was solving system of linear equations for N around 1000 a
>> a little more.
>> In both cases precision of DP FP was perfectly sufficient both for
>> inputs and for outputs. But errors accumulated in intermediate caused
>> troubles.
>>
>> In both cases quad-precision FP was key to solution.
>>
>> For DCT (FFT) I went for full re-implementation at higher precision.
>>
>> For Linear Solver, I left LU decomposition, which happens to be the
>> heavy O(N**3) part in DP. Quad-precision was applied only during final
>> solver stages - forward propagation, back propagation, calculation of
>> residual error vector and repetition of forward and back propagation.
>> All those parts are O(N**2). That modification was sufficient to
>> improve precision of result almost to the best possible in DP FP format.
>> And sufficient for good convergence of Parksâ€“McClellan algorithm.
>>
>>
>> So, what is a point of my anecdote?
>> The speed of quad-precision FP was never an obstacle, even when running
>> on rather old hardware. And it's not like calculations here were not
>> heavy. They were heavy all right, thank you very much.
>> Using quad-precision only when necessary helped.
>> But what helped more is not being hesitant. Doing things instead of
>> worrying that they would be too slow.
> 
> I thihnk the main issue is similar to what we had before 754, i.e every 
> fp programmer needed to also be a fp analyst, capable of carrying out 
> error budget calculation across their algorithms.
> 
> You can obviously do that, and so can a number of regulars here, but in 
> the real world we are in a _very_ small minority.
> 
> For the rest, just having fp128 fast enough that that it could be 
> applied naively would solve a number of problems.
> 

As I see it, FP128 is fast enough for practical use even with a 
software-only implementation (though, in part due to its relatively low 
usage frequency; if it is used, it is mostly for cases that actually 
need precision, rather than high throughput, with high-throughput cases 
likely to remain dominated by smaller types, like Binary32 and Binary16; 
with Binary64 more remaining as the "de-facto default" precision for 
floating-point).

As can be noted, in my case, it was a partial motivation for supporting 
things like 128-bit integer instructions (in my C compiler, and 
optionally in the underlying ISA), as supporting Int128 ops is a step 
towards making doing Binary128 in software more practical (without the 
steep cost of a 128-bit FPU).

...


> Terje
>

[toc] | [prev] | [next] | [standalone]

#114685 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-07 21:18 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1767820734-5857@newsgrouper.org>
In reply to	#114684

BGB <cr88192@gmail.com> posted:

> On 1/7/2026 11:56 AM, Terje Mathisen wrote:
> > Michael S wrote:
> >> On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
> >> jgd@cix.co.uk (John Dallman) wrote:
> >>
> >>> In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
> >>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >>>
> >>>> Possibly.  But the lack of takeup of the Intel library and of the
> >>>> gcc support shows that "build it and they will come" does not work
> >>>> out for DFP.
> >>>
> >>> The world has got very used to IEEE BFP, and has solutions that work
> >>> acceptably with it. Lots of organisations don't see anything obvious
> >>> for them in DFP.
> >>>
> >>> The thing I'd like to try out is fast quad-precision BFP. For the
> >>> field I work in, that would make some things much simpler. I did try
> >>> to interest AMD in the idea in the early days of x86-64, but they
> >>> didn't bite.
> >>>
> >>> John
> >>
> >> I already asked you couple of years ago how fast do want binary128 in
> >> order to consider it fast enough.
> >> IIRC, you either avoided the answer completely or gave totally
> >> unrealistic answer like "the same as binary64".
> >> May be, nobody bites because with non-answers or answers like that
> >> nobody thinks that you are serious?
> >>
> >> Anecdote.
> >> Few months ago I tried to design very long decimation filters with stop
> >> band attenuation of ~160 dB.
> >> Matlab's implementation of Parksâ€“McClellan algorithm (a customize
> >> variation of Remez Exchange spiced with a small portion of black magic)
> >> was not up to the task. Gnu Octave implementation was somewhat worse
> >> yet.
> >> When I started to investigate the reasons I found out that there were
> >> actually two of them, both related to insufficient precision of the
> >> series of DP FP calculations.
> >> The first was Discreet Cosine Transform and underlying FFT engine for N
> >> around 32K.
> >> The second was solving system of linear equations for N around 1000 a
> >> a little more.
> >> In both cases precision of DP FP was perfectly sufficient both for
> >> inputs and for outputs. But errors accumulated in intermediate caused
> >> troubles.
> >>
> >> In both cases quad-precision FP was key to solution.
> >>
> >> For DCT (FFT) I went for full re-implementation at higher precision.
> >>
> >> For Linear Solver, I left LU decomposition, which happens to be the
> >> heavy O(N**3) part in DP. Quad-precision was applied only during final
> >> solver stages - forward propagation, back propagation, calculation of
> >> residual error vector and repetition of forward and back propagation.
> >> All those parts are O(N**2). That modification was sufficient to
> >> improve precision of result almost to the best possible in DP FP format.
> >> And sufficient for good convergence of Parksâ€“McClellan algorithm.
> >>
> >>
> >> So, what is a point of my anecdote?
> >> The speed of quad-precision FP was never an obstacle, even when running
> >> on rather old hardware. And it's not like calculations here were not
> >> heavy. They were heavy all right, thank you very much.
> >> Using quad-precision only when necessary helped.
> >> But what helped more is not being hesitant. Doing things instead of
> >> worrying that they would be too slow.
> > 
> > I thihnk the main issue is similar to what we had before 754, i.e every 
> > fp programmer needed to also be a fp analyst, capable of carrying out 
> > error budget calculation across their algorithms.
> > 
> > You can obviously do that, and so can a number of regulars here, but in 
> > the real world we are in a _very_ small minority.
> > 
> > For the rest, just having fp128 fast enough that that it could be 
> > applied naively would solve a number of problems.
> > 
> 
> As I see it, FP128 is fast enough for practical use even with a 
> software-only implementation (though, in part due to its relatively low 
> usage frequency; if it is used, it is mostly for cases that actually 
> need precision, rather than high throughput, with high-throughput cases 
> likely to remain dominated by smaller types, like Binary32 and Binary16; 
> with Binary64 more remaining as the "de-facto default" precision for 
> floating-point).

{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.

> As can be noted, in my case, it was a partial motivation for supporting 
> things like 128-bit integer instructions (in my C compiler, and 
> optionally in the underlying ISA), as supporting Int128 ops is a step 
> towards making doing Binary128 in software more practical (without the 
> steep cost of a 128-bit FPU).

It seems to me that if one ahs "reasonable" ISA support for tearing a
128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
integer support, then emulating 128-bit FP in SW is "not that bad"--
especially if one can do 128×128 -> 256 in 4-8 cycles.
 
> ...
> 
> 
> > Terje
> > 
>

[toc] | [prev] | [next] | [standalone]

#114687 — Re: floating point history, word order and byte order

From	BGB <cr88192@gmail.com>
Date	2026-01-07 16:10 -0600
Subject	Re: floating point history, word order and byte order
Message-ID	<10jmlk0$11rue$1@dont-email.me>
In reply to	#114685

On 1/7/2026 3:18 PM, MitchAlsup wrote:
> 
> BGB <cr88192@gmail.com> posted:
> 
>> On 1/7/2026 11:56 AM, Terje Mathisen wrote:
>>> Michael S wrote:
>>>> On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
>>>> jgd@cix.co.uk (John Dallman) wrote:
>>>>
>>>>> In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
>>>>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>>>>
>>>>>> Possibly.  But the lack of takeup of the Intel library and of the
>>>>>> gcc support shows that "build it and they will come" does not work
>>>>>> out for DFP.
>>>>>
>>>>> The world has got very used to IEEE BFP, and has solutions that work
>>>>> acceptably with it. Lots of organisations don't see anything obvious
>>>>> for them in DFP.
>>>>>
>>>>> The thing I'd like to try out is fast quad-precision BFP. For the
>>>>> field I work in, that would make some things much simpler. I did try
>>>>> to interest AMD in the idea in the early days of x86-64, but they
>>>>> didn't bite.
>>>>>
>>>>> John
>>>>
>>>> I already asked you couple of years ago how fast do want binary128 in
>>>> order to consider it fast enough.
>>>> IIRC, you either avoided the answer completely or gave totally
>>>> unrealistic answer like "the same as binary64".
>>>> May be, nobody bites because with non-answers or answers like that
>>>> nobody thinks that you are serious?
>>>>
>>>> Anecdote.
>>>> Few months ago I tried to design very long decimation filters with stop
>>>> band attenuation of ~160 dB.
>>>> Matlab's implementation of Parksâ€“McClellan algorithm (a customize
>>>> variation of Remez Exchange spiced with a small portion of black magic)
>>>> was not up to the task. Gnu Octave implementation was somewhat worse
>>>> yet.
>>>> When I started to investigate the reasons I found out that there were
>>>> actually two of them, both related to insufficient precision of the
>>>> series of DP FP calculations.
>>>> The first was Discreet Cosine Transform and underlying FFT engine for N
>>>> around 32K.
>>>> The second was solving system of linear equations for N around 1000 a
>>>> a little more.
>>>> In both cases precision of DP FP was perfectly sufficient both for
>>>> inputs and for outputs. But errors accumulated in intermediate caused
>>>> troubles.
>>>>
>>>> In both cases quad-precision FP was key to solution.
>>>>
>>>> For DCT (FFT) I went for full re-implementation at higher precision.
>>>>
>>>> For Linear Solver, I left LU decomposition, which happens to be the
>>>> heavy O(N**3) part in DP. Quad-precision was applied only during final
>>>> solver stages - forward propagation, back propagation, calculation of
>>>> residual error vector and repetition of forward and back propagation.
>>>> All those parts are O(N**2). That modification was sufficient to
>>>> improve precision of result almost to the best possible in DP FP format.
>>>> And sufficient for good convergence of Parksâ€“McClellan algorithm.
>>>>

FWIW: For most cases where I had used DCT or FFT, it has almost always 
been with fixed-point integer math...

>>>>
>>>> So, what is a point of my anecdote?
>>>> The speed of quad-precision FP was never an obstacle, even when running
>>>> on rather old hardware. And it's not like calculations here were not
>>>> heavy. They were heavy all right, thank you very much.
>>>> Using quad-precision only when necessary helped.
>>>> But what helped more is not being hesitant. Doing things instead of
>>>> worrying that they would be too slow.
>>>
>>> I thihnk the main issue is similar to what we had before 754, i.e every
>>> fp programmer needed to also be a fp analyst, capable of carrying out
>>> error budget calculation across their algorithms.
>>>
>>> You can obviously do that, and so can a number of regulars here, but in
>>> the real world we are in a _very_ small minority.
>>>
>>> For the rest, just having fp128 fast enough that that it could be
>>> applied naively would solve a number of problems.
>>>
>>
>> As I see it, FP128 is fast enough for practical use even with a
>> software-only implementation (though, in part due to its relatively low
>> usage frequency; if it is used, it is mostly for cases that actually
>> need precision, rather than high throughput, with high-throughput cases
>> likely to remain dominated by smaller types, like Binary32 and Binary16;
>> with Binary64 more remaining as the "de-facto default" precision for
>> floating-point).
> 
> {To date::}
> My only used for 128-bit FP was to compute Chebyshev Coefficients for
> my high speed DP Transcendentals. I only needed 64-bits of fractions
> but, in practice, 80-bit FP was only giving me 63-bits of precision.
> Since these are a) compute once b) use infinitely many times; the
> speed of 128-bit FP is completely irrelevant.
> 

As noted, low usage frequency.

If it is something that mostly applies to initial program startup or 
occasionally in the slow path, that it is "kinda slow" doesn't matter 
too much.

Though, it is starting to seem that "trap and emulate" might still be a 
little too slow, leading to my recent efforts in the direction of 
efficient hot-patching.

Granted, this is more a case of "just sort of pushing the cost somewhere 
else" and in theory, if the compiler knows that the instruction will 
just be patched anyways, it could in premise generate intermediate calls 
for cheaper.

But, for Binary128 there is another factor:
RV64G/RV64GC lacks access to 128-bit integer instructions;
So, it makes sense to instead run this logic in XG3;
But, compiler can't just use XG3, as if it uses any XG3 ops, may as well 
just compile the whole binary as XG3;
So, it makes sense to use XG3 as a "make RV64 less poor" feature, but 
then the compiler can't be allowed to depend on it directly, and at 
least needs to pretend it is living in RV64 land.

But, then, this leads to hot-patch wonk.

>> As can be noted, in my case, it was a partial motivation for supporting
>> things like 128-bit integer instructions (in my C compiler, and
>> optionally in the underlying ISA), as supporting Int128 ops is a step
>> towards making doing Binary128 in software more practical (without the
>> steep cost of a 128-bit FPU).
> 
> It seems to me that if one ahs "reasonable" ISA support for tearing a
> 128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
> integer support, then emulating 128-bit FP in SW is "not that bad"--
> especially if one can do 128×128 -> 256 in 4-8 cycles.
>   

Yeah, this is basically the idea.

Int128 ops, and my BITMOV instructions (which can extract/insert/move 
bitfields within 64 and 128 bit containers; as a combined "Shift and 
masked MUX"), can provide a nice boost here.

Sadly, there is still not really a great way to do a 128x128 => 256 
multiply though. Current fastest option is still to decompose it into a 
crapload of 32x32=>64 bit widening multiply ops (which, ironically, is 
another thing that RV is lacking in; need to use a full 64-bit multiply, 
but there are downsides, more-so when the base ISA is also lacking 
PACK/PACKU).

Still kinda funny that RV land, with all of its wide industrial support, 
lots of people doing lots of extensions, advanced features, etc. 
Seemingly still fails at making an ISA where "basic things" fit together 
well.

And, then a lot of features going off in rabbit holes like "why would 
you want this?", and then it turns out it is to micro-optimize some 
specific test case within SPECint or something (often, rather than 
finding a more general solution that would address multiple related issues).

More so when the "micro-optimize the benchmark" features were more often 
chosen over the more general purpose "actually address the underlying 
issue" features.

Granted, then someone is almost invariably going to be like "all the 
parts of RV do fit together well, but you are using it wrong...".

But, in this case, would expect GCC to generate smaller binaries than 
BGBCC; leaving me to think it is more a case of "these parts don't fit 
together all that well".

>> ...
>>
>>
>>> Terje
>>>
>>

[toc] | [prev] | [next] | [standalone]

#114689 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-08 00:05 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1767830733-5857@newsgrouper.org>
In reply to	#114687

BGB <cr88192@gmail.com> posted:

> On 1/7/2026 3:18 PM, MitchAlsup wrote:
> > 
> > BGB <cr88192@gmail.com> posted:
> > 
> >> On 1/7/2026 11:56 AM, Terje Mathisen wrote:
> >>> Michael S wrote:
> >>>> On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
> >>>> jgd@cix.co.uk (John Dallman) wrote:
> >>>>
> >>>>> In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
> >>>>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >>>>>
> >>>>>> Possibly.  But the lack of takeup of the Intel library and of the
> >>>>>> gcc support shows that "build it and they will come" does not work
> >>>>>> out for DFP.
> >>>>>
> >>>>> The world has got very used to IEEE BFP, and has solutions that work
> >>>>> acceptably with it. Lots of organisations don't see anything obvious
> >>>>> for them in DFP.
> >>>>>
> >>>>> The thing I'd like to try out is fast quad-precision BFP. For the
> >>>>> field I work in, that would make some things much simpler. I did try
> >>>>> to interest AMD in the idea in the early days of x86-64, but they
> >>>>> didn't bite.
> >>>>>
> >>>>> John
> >>>>
> >>>> I already asked you couple of years ago how fast do want binary128 in
> >>>> order to consider it fast enough.
> >>>> IIRC, you either avoided the answer completely or gave totally
> >>>> unrealistic answer like "the same as binary64".
> >>>> May be, nobody bites because with non-answers or answers like that
> >>>> nobody thinks that you are serious?
> >>>>
> >>>> Anecdote.
> >>>> Few months ago I tried to design very long decimation filters with stop
> >>>> band attenuation of ~160 dB.
> >>>> Matlab's implementation of Parksâ€“McClellan algorithm (a customize
> >>>> variation of Remez Exchange spiced with a small portion of black magic)
> >>>> was not up to the task. Gnu Octave implementation was somewhat worse
> >>>> yet.
> >>>> When I started to investigate the reasons I found out that there were
> >>>> actually two of them, both related to insufficient precision of the
> >>>> series of DP FP calculations.
> >>>> The first was Discreet Cosine Transform and underlying FFT engine for N
> >>>> around 32K.
> >>>> The second was solving system of linear equations for N around 1000 a
> >>>> a little more.
> >>>> In both cases precision of DP FP was perfectly sufficient both for
> >>>> inputs and for outputs. But errors accumulated in intermediate caused
> >>>> troubles.
> >>>>
> >>>> In both cases quad-precision FP was key to solution.
> >>>>
> >>>> For DCT (FFT) I went for full re-implementation at higher precision.
> >>>>
> >>>> For Linear Solver, I left LU decomposition, which happens to be the
> >>>> heavy O(N**3) part in DP. Quad-precision was applied only during final
> >>>> solver stages - forward propagation, back propagation, calculation of
> >>>> residual error vector and repetition of forward and back propagation.
> >>>> All those parts are O(N**2). That modification was sufficient to
> >>>> improve precision of result almost to the best possible in DP FP format.
> >>>> And sufficient for good convergence of Parksâ€“McClellan algorithm.
> >>>>
> 
> FWIW: For most cases where I had used DCT or FFT, it has almost always 
> been with fixed-point integer math...
> 
> 
> >>>>
> >>>> So, what is a point of my anecdote?
> >>>> The speed of quad-precision FP was never an obstacle, even when running
> >>>> on rather old hardware. And it's not like calculations here were not
> >>>> heavy. They were heavy all right, thank you very much.
> >>>> Using quad-precision only when necessary helped.
> >>>> But what helped more is not being hesitant. Doing things instead of
> >>>> worrying that they would be too slow.
> >>>
> >>> I thihnk the main issue is similar to what we had before 754, i.e every
> >>> fp programmer needed to also be a fp analyst, capable of carrying out
> >>> error budget calculation across their algorithms.
> >>>
> >>> You can obviously do that, and so can a number of regulars here, but in
> >>> the real world we are in a _very_ small minority.
> >>>
> >>> For the rest, just having fp128 fast enough that that it could be
> >>> applied naively would solve a number of problems.
> >>>
> >>
> >> As I see it, FP128 is fast enough for practical use even with a
> >> software-only implementation (though, in part due to its relatively low
> >> usage frequency; if it is used, it is mostly for cases that actually
> >> need precision, rather than high throughput, with high-throughput cases
> >> likely to remain dominated by smaller types, like Binary32 and Binary16;
> >> with Binary64 more remaining as the "de-facto default" precision for
> >> floating-point).
> > 
> > {To date::}
> > My only used for 128-bit FP was to compute Chebyshev Coefficients for
> > my high speed DP Transcendentals. I only needed 64-bits of fractions
> > but, in practice, 80-bit FP was only giving me 63-bits of precision.
> > Since these are a) compute once b) use infinitely many times; the
> > speed of 128-bit FP is completely irrelevant.
> > 
> 
> As noted, low usage frequency.
> 
> If it is something that mostly applies to initial program startup or 
> occasionally in the slow path, that it is "kinda slow" doesn't matter 
> too much.
> 
> Though, it is starting to seem that "trap and emulate" might still be a 
> little too slow, leading to my recent efforts in the direction of 
> efficient hot-patching.

Depends on the speed of T&E. If privilege control transfer is 10-cycles then
its probably OK, if 100+ it is getting on the annoying side of thiigns.
 
> Granted, this is more a case of "just sort of pushing the cost somewhere 
> else" and in theory, if the compiler knows that the instruction will 
> just be patched anyways, it could in premise generate intermediate calls 
> for cheaper.
------------------- 
> >> As can be noted, in my case, it was a partial motivation for supporting
> >> things like 128-bit integer instructions (in my C compiler, and
> >> optionally in the underlying ISA), as supporting Int128 ops is a step
> >> towards making doing Binary128 in software more practical (without the
> >> steep cost of a 128-bit FPU).
> > 
> > It seems to me that if one ahs "reasonable" ISA support for tearing a
> > 128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
> > integer support, then emulating 128-bit FP in SW is "not that bad"--
> > especially if one can do 128×128 -> 256 in 4-8 cycles.
> >   
> 
> Yeah, this is basically the idea.
> 
> Int128 ops, and my BITMOV instructions (which can extract/insert/move 
> bitfields within 64 and 128 bit containers; as a combined "Shift and 
> masked MUX"), can provide a nice boost here.
> 
> Sadly, there is still not really a great way to do a 128x128 => 256 
> multiply though.

My Transcendentals get to 1ULP when the multiplier tree is 59×59-bits
{a bit more than ½ of the get 1ULP at 58×58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

To get 64×64->128 I need my <single> prefix instruction CARRY. But
this also gives me (64×64->128)/64 ->{64,64} {quotient, remainder}.

>                  Current fastest option is still to decompose it into a 
> crapload of 32x32=>64 bit widening multiply ops (which, ironically, is 
> another thing that RV is lacking in; need to use a full 64-bit multiply, 
> but there are downsides, more-so when the base ISA is also lacking 
> PACK/PACKU).

"Not my fault".
 
> Still kinda funny that RV land, with all of its wide industrial support, 
> lots of people doing lots of extensions, advanced features, etc. 
> Seemingly still fails at making an ISA where "basic things" fit together 
> well.
> 
> And, then a lot of features going off in rabbit holes like "why would 
> you want this?", and then it turns out it is to micro-optimize some 
> specific test case within SPECint or something (often, rather than 
> finding a more general solution that would address multiple related issues).

Reasonable support for 64×64->128 is what makes emulation "affordable".

Side note: Back in 1987, MIPS has 13-cycle multiply using their non-
pipelined FU and special registers--while Mc 88100 has 3 cycle 32×32
multiply in 3 cycles. Well, it ends up one could program this multiplier
to do 32×32->64 in 13 cycles; TOO !!
 
> More so when the "micro-optimize the benchmark" features were more often 
> chosen over the more general purpose "actually address the underlying 
> issue" features.

Been there done that......
> 
> 
> Granted, then someone is almost invariably going to be like "all the 
> parts of RV do fit together well, but you are using it wrong...".
> 
> But, in this case, would expect GCC to generate smaller binaries than 
> BGBCC; leaving me to think it is more a case of "these parts don't fit 
> together all that well".
> 
> 
> 
> >> ...
> >>
> >>
> >>> Terje
> >>>
> >>
>

[toc] | [prev] | [next] | [standalone]

#114691 — Re: floating point history, word order and byte order

From	MitchAlsup <user5857@newsgrouper.org.invalid>
Date	2026-01-08 02:38 +0000
Subject	Re: floating point history, word order and byte order
Message-ID	<1767839937-5857@newsgrouper.org>
In reply to	#114689

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
> 
> My Transcendentals get to 1ULP when the multiplier tree is 59×59-bits
> {a bit more than ½ of the get 1ULP at 58×58}. I gave a lot of though
> to this {~1 year} before deciding that a "Do everything else" function
> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

[toc] | [prev] | [next] | [standalone]

#114693 — Re: floating point history, word order and byte order

From	Michael S <already5chosen@yahoo.com>
Date	2026-01-08 10:52 +0200
Subject	Re: floating point history, word order and byte order
Message-ID	<20260108105221.0000099e@yahoo.com>
In reply to	#114691

On Thu, 08 Jan 2026 02:38:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

> MitchAlsup <user5857@newsgrouper.org.invalid> posted:
> -------------------
> > 
> > My Transcendentals get to 1ULP when the multiplier tree is
> > 59×59-bits {a bit more than ½ of the get 1ULP at 58×58}. I gave a
> > lot of though to this {~1 year} before deciding that a "Do
> > everything else" function unit was "overall" better than a couple
> > of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
> > cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
> > up a lot for the "size" of the FU.  
> 
> I forgot to add that my transcendentals went from <just barely>
> faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
> expanded.
> 

Don't you mean '0.5002 ULP' ?

[toc] | [prev] | [next] | [standalone]

Page 5 of 11 — ← Prev page 1 … 3 4 [5] 6 7 … 11 Next page →

csiph-web

Linus Torvalds on bad architectural features

Contents

#114681 — Re: floating point history, word order and byte order

#114686 — Re: floating point history, word order and byte order

#114688 — Re: floating point history, word order and byte order

#114717 — Re: floating point history, word order and byte order

#114683 — Re: floating point history, word order and byte order

#114690 — Re: floating point history, word order and byte order

#114784 — Re: floating point history, word order and byte order

#114788 — Re: floating point history, word order and byte order

#114789 — Re: floating point history, word order and byte order

#114790 — Re: floating point history, word order and byte order

#114792 — Re: floating point history, word order and byte order

#114798 — Re: floating point history, word order and byte order

#114801 — Re: floating point history, word order and byte order

#114673 — Re: floating point history, word order and byte order

#114684 — Re: floating point history, word order and byte order

#114685 — Re: floating point history, word order and byte order

#114687 — Re: floating point history, word order and byte order

#114689 — Re: floating point history, word order and byte order

#114691 — Re: floating point history, word order and byte order

#114693 — Re: floating point history, word order and byte order