Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.arch.embedded > #7760 > unrolled thread

Floating point vs fixed arithmetics (signed 64-bit)

Started bykishor <kiishor@gmail.com>
First post2012-03-26 02:22 -0700
Last post2012-03-28 22:59 +0300
Articles 16 on this page of 56 — 20 participants

Back to article view | Back to comp.arch.embedded


Contents

  Floating point vs fixed arithmetics (signed 64-bit) kishor <kiishor@gmail.com> - 2012-03-26 02:22 -0700
    Re: Floating point vs fixed arithmetics (signed 64-bit) "Boudewijn Dijkstra" <sp4mtr4p.boudewijn@indes.com> - 2012-03-26 12:08 +0200
    Re: Floating point vs fixed arithmetics (signed 64-bit) Arlet Ottens <usenet+5@c-scape.nl> - 2012-03-26 13:14 +0200
      Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-26 13:24 +0200
        Re: Floating point vs fixed arithmetics (signed 64-bit) kishor <kiishor@gmail.com> - 2012-03-26 05:24 -0700
          Re: Floating point vs fixed arithmetics (signed 64-bit) Fredrik Östman <Fredrik_Oestman@work.invalid> - 2012-03-26 12:38 +0000
            Re: Floating point vs fixed arithmetics (signed 64-bit) kishor <kiishor@gmail.com> - 2012-03-26 06:33 -0700
              Re: Floating point vs fixed arithmetics (signed 64-bit) Arlet Ottens <usenet+5@c-scape.nl> - 2012-03-26 15:49 +0200
              Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-26 15:45 +0200
              Re: Floating point vs fixed arithmetics (signed 64-bit) Fredrik Östman <Fredrik_Oestman@work.invalid> - 2012-03-26 14:34 +0000
          Re: Floating point vs fixed arithmetics (signed 64-bit) Arlet Ottens <usenet+5@c-scape.nl> - 2012-03-26 15:34 +0200
            Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-03-26 12:25 -0500
              Re: Floating point vs fixed arithmetics (signed 64-bit) Arlet Ottens <usenet+5@c-scape.nl> - 2012-03-26 20:19 +0200
                Re: Floating point vs fixed arithmetics (signed 64-bit) Rich Webb <bbew.ar@mapson.nozirev.ten> - 2012-03-26 16:45 -0400
                  Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-03-26 17:15 -0500
                    Re: Floating point vs fixed arithmetics (signed 64-bit) Rich Webb <bbew.ar@mapson.nozirev.ten> - 2012-03-26 19:09 -0400
                      Re: Floating point vs fixed arithmetics (signed 64-bit) kishor <kiishor@gmail.com> - 2012-03-27 04:59 -0700
                        Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-27 15:25 +0200
                          Re: Floating point vs fixed arithmetics (signed 64-bit) David T. Ashley <dashley@gmail.com> - 2012-03-29 13:17 -0400
    Re: Floating point vs fixed arithmetics (signed 64-bit) "Paul E. Bennett" <Paul_E.Bennett@topmail.co.uk> - 2012-03-27 11:28 +0100
    Re: Floating point vs fixed arithmetics (signed 64-bit) David T. Ashley <dashley@gmail.com> - 2012-03-27 11:28 -0400
      Re: Floating point vs fixed arithmetics (signed 64-bit) upsidedown@downunder.com - 2012-03-27 18:52 +0300
        Re: Floating point vs fixed arithmetics (signed 64-bit) David T. Ashley <dashley@gmail.com> - 2012-03-27 13:02 -0400
          Re: Floating point vs fixed arithmetics (signed 64-bit) Walter Banks <walter@bytecraft.com> - 2012-03-27 13:56 -0500
            Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-03-27 14:17 -0500
              Re: Floating point vs fixed arithmetics (signed 64-bit) Walter Banks <walter@bytecraft.com> - 2012-03-27 15:35 -0500
                Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.please> - 2012-03-27 22:36 -0500
            Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-28 09:00 +0200
            Re: Floating point vs fixed arithmetics (signed 64-bit) j.m.granville@gmail.com - 2012-03-30 04:08 -0700
              Re: Floating point vs fixed arithmetics (signed 64-bit) Mark Borgerson <mborgerson@comcast.net> - 2012-04-02 22:52 -0700
                Re: Floating point vs fixed arithmetics (signed 64-bit) John Devereux <john@devereux.me.uk> - 2012-04-03 11:33 +0100
                  Re: Floating point vs fixed arithmetics (signed 64-bit) Anders.Montonen@kapsi.spam.stop.fi.invalid - 2012-04-03 12:05 +0000
                    Re: Floating point vs fixed arithmetics (signed 64-bit) John Devereux <john@devereux.me.uk> - 2012-04-03 16:34 +0100
                      Re: Floating point vs fixed arithmetics (signed 64-bit) Paul <paul@pcserviceselectronics.co.uk> - 2012-04-04 09:35 +0100
              Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-04-03 13:52 -0500
                Re: Floating point vs fixed arithmetics (signed 64-bit) Mark Borgerson <mborgerson@comcast.net> - 2012-04-04 16:50 -0700
                  Re: Floating point vs fixed arithmetics (signed 64-bit) John Devereux <john@devereux.me.uk> - 2012-04-05 11:48 +0100
          Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-28 09:17 +0200
            Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-03-28 12:20 -0500
              Re: Floating point vs fixed arithmetics (signed 64-bit) Andrew Reilly <areilly---@bigpond.net.au> - 2012-03-28 22:44 +0000
                Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-03-28 18:35 -0500
                  Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-29 10:58 +0200
                  Re: Floating point vs fixed arithmetics (signed 64-bit) Mark Borgerson <mborgerson@comcast.net> - 2012-03-29 07:56 -0700
                    Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.com> - 2012-03-29 16:52 -0500
                      Re: Floating point vs fixed arithmetics (signed 64-bit) Mark Borgerson <mborgerson@comcast.net> - 2012-03-29 21:19 -0700
                        Re: Floating point vs fixed arithmetics (signed 64-bit) Tim Wescott <tim@seemywebsite.please> - 2012-03-30 00:42 -0500
                Re: Floating point vs fixed arithmetics (signed 64-bit) upsidedown@downunder.com - 2012-03-29 07:19 +0300
                  Re: Floating point vs fixed arithmetics (signed 64-bit) Andrew Reilly <areilly---@bigpond.net.au> - 2012-03-29 11:53 +0000
                    Re: Floating point vs fixed arithmetics (signed 64-bit) Walter Banks <walter@bytecraft.com> - 2012-03-29 09:40 -0500
                    Re: Floating point vs fixed arithmetics (signed 64-bit) upsidedown@downunder.com - 2012-03-29 23:46 +0300
                  Re: Floating point vs fixed arithmetics (signed 64-bit) Walter Banks <walter@bytecraft.com> - 2012-03-29 09:28 -0500
                    Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-29 16:58 +0200
              Re: Floating point vs fixed arithmetics (signed 64-bit) David Brown <david@westcontrol.removethisbit.com> - 2012-03-29 10:09 +0200
              Re: Floating point vs fixed arithmetics (signed 64-bit) Clifford Heath <cjh@no.spam.please.net> - 2012-04-01 18:08 +1000
            Re: Floating point vs fixed arithmetics (signed 64-bit) dp <dp@tgi-sci.com> - 2012-03-28 02:38 -0700
        Re: Floating point vs fixed arithmetics (signed 64-bit) upsidedown@downunder.com - 2012-03-28 22:59 +0300

Page 3 of 3 — ← Prev page 1 2 [3]


#7821

FromTim Wescott <tim@seemywebsite.com>
Date2012-03-28 18:35 -0500
Message-ID<oKCdnSXl7OPQPe7SnZ2dnUVZ_gednZ2d@web-ster.com>
In reply to#7820
On Wed, 28 Mar 2012 22:44:32 +0000, Andrew Reilly wrote:

> On Wed, 28 Mar 2012 12:20:51 -0500, Tim Wescott wrote:
> 
>> On Wed, 28 Mar 2012 09:17:14 +0200, David Brown wrote:
>> 
>>> On 27/03/2012 19:02, David T. Ashley wrote:
>>>> On Tue, 27 Mar 2012 18:52:09 +0300, upsidedown@downunder.com wrote:
>>>>
>>>>> On Tue, 27 Mar 2012 11:28:18 -0400, David T. Ashley
>>>>> <dashley@gmail.com>  wrote:
>>>>>
>>>>>
>>>>>> Without FPU support, assuming that the processor has basic integer
>>>>>> multiplication instructions, integer operations are ALWAYS faster
>>>>>> than floating-point operations.  Usually _far_ faster.  And always
>>>>>> more precise.
>>>>>
>>>>> Floating point instructions MUL/DIV are trivial, just
>>>>> multiply/divide the mantissa and add/sub the exponent.
>>>>>
>>>>> With FP add/sub you have to denormalize one operand and then
>>>>> normalize the result, which can be quite time consuming, without
>>>>> sufficient HW support.
>>>>>
>>>>> This can be really time consuming, if the HW is designed by an
>>>>> idiot.
>>>>
>>>> Your observations are valid.  But I have yet to see a practical
>>>> example of something that can be done faster and with equal accuracy
>>>> in floating point vs. using integer operations.
>>>>
>>>>
>>> It depends on the chip, the type of floating point hardware it has,
>>> the operations you need, the compiler, and the code quality.  For a
>>> lot of heavy calculations done with integer arithmetic, you need a
>>> number of "extra" instructions as well as the basic add, subtract,
>>> multiply and divides.  You might need shifts for scaling, mask
>>> operations, extra code to get the signs right, etc.  And the paths for
>>> these are likely to be highly serialised, with each depending directly
>>> on the results of the previous operation, which slows down pipelining.
>>>  With hardware floating point, you have a much simpler instruction
>>> stream, which can result in faster throughput even if the actual
>>> latency for the calculations is the same.
>>> 
>>> This effect increases with the size and complexity of the processor.
>>> Obviously it is dependent on the processor having floating point
>>> hardware for the precision needed (single or double), but once you
>>> have any sort of hardware floating point you should re-check all your
>>> assumptions about speed differences.  You could be wrong in either
>>> direction.
>> 
>> The key point is "it is dependent on the processor having floating
>> point hardware for the precision needed".  And, I might add, on other
>> things --
>> see Walter Banks's comments in another sub-thread about 32-bit floating
>> point vs. 32-bit integer math.
>> 
>> In my experience with signal processing and control loops, having a
>> library that implements fixed-point, fractional arithmetic with
>> saturation on addition and shift-up is often faster that floating point
>> _or_ "pure" integer math, and sidesteps a number of problems with both.
>> It's at the cost of a learning curve with anyone using the package, but
>> it works well.
>> 
>> On all the processors I've tried it except for x86 processors, there's
>> been a 3-20x speedup once I've hand-written the assembly code to do the
>> computation (and that's without understanding or trying to accommodate
>> any pipelines that may exist).
> 
> Weren't you the one that said that your (tuned) ARM C code was generally
> only a factor of 1.2 worse than the best hand-tweaked assembly code?
> Maybe not, but I've seen it said in these parts.  Certainly, my
> experience is that that is quite good rule of thumb, and it is very
> difficult to get more than a factor of two between assembler and C
> unless the platform in question has a very poor C compiler or the
> assembly code is actually implementing a different algorithm (which is
> sometimes possible, but much rarer in these days of well-supplied
> intrinsic function libraries.)

When the compiler can figure out what I mean, yes, it is usually at least 
almost as good as I can do, and sometimes better (I don't carry around 
all the instruction reordering rules in my head: the compiler does).

With fixed-point arithmetic stuff, though, the compiler never seems to 
"get it".

>> But on the x86 -- which is the _only_ processor that I've tried it that
>> had floating point -- 32-bit fractional arithmetic is slower than
>> 64-bit floating point.
> 
> One thing that gives float a particualr edge on the x86(32) (but which
> can also apply to other processors) is that using floating point means
> that you don't have to use the precious integer register set for data:
> it can be used for pointers, counters and other control periphera,
> leaving the working "data state" in the FPU registers.  Modern SIMD
> units can do integer operations as well as floating point, so the "extra
> state" argument might seem weaker, but I've never seen a compiler use
> SIMD registers for integer calculations (unless forced to with intrinsic
> functions).

So, the next time I try this on x86 I should use the SIMD registers.

Actually, if you know you're going to be doing things like vector dot 
products, then you could probably get some significant speed-up by doing 
a spot of assembly here and there.  I haven't had occasion to try this on 
an x86, though.

>> So, yes -- whether integer (or fixed point) arithmetic is going to be
>> faster than floating point depends _a lot_ on the processor.  So
>> instead of automatically deciding to do everything "the hard way" and
>> feeling clever and virtuous thereby, you should _benchmark_ the
>> performance of a code sample with floating point vs. whatever
>> fixed-point poison you choose.
> 
> Fast isn't always the only consideration, though.  Floating point is
> *always* going to be more power-hungry than fixed point, simply because
> it is doing a bunch of extra work at run-time that fixed-point forces
> you to hoist to compile-time.

It'll be power hungry twice if you select a chip that has floating point 
hardware.  I never seem to have the budget -- either dollars or watts -- 
to use such processors.

> The advice to benchmark is excellent, of course.  Particularly because
> the results won't necessarily be what you expect.

Yes.  Even when I expect anti-intuitive results, I can still be 
astonished by benchmarks.

-- 
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com

[toc] | [prev] | [next] | [standalone]


#7824

FromDavid Brown <david@westcontrol.removethisbit.com>
Date2012-03-29 10:58 +0200
Message-ID<Yd6dnZuuFvyHu-nSnZ2dnUVZ8r2dnZ2d@lyse.net>
In reply to#7821
On 29/03/2012 01:35, Tim Wescott wrote:
> On Wed, 28 Mar 2012 22:44:32 +0000, Andrew Reilly wrote:
>
>> On Wed, 28 Mar 2012 12:20:51 -0500, Tim Wescott wrote:
>>
>>> On Wed, 28 Mar 2012 09:17:14 +0200, David Brown wrote:
>>>
>>>> On 27/03/2012 19:02, David T. Ashley wrote:
>>>>> On Tue, 27 Mar 2012 18:52:09 +0300, upsidedown@downunder.com wrote:
>>>>>
>>>>>> On Tue, 27 Mar 2012 11:28:18 -0400, David T. Ashley
>>>>>> <dashley@gmail.com>   wrote:
>>>>>>
>>>>>>
>>>>>>> Without FPU support, assuming that the processor has basic integer
>>>>>>> multiplication instructions, integer operations are ALWAYS faster
>>>>>>> than floating-point operations.  Usually _far_ faster.  And always
>>>>>>> more precise.
>>>>>>
>>>>>> Floating point instructions MUL/DIV are trivial, just
>>>>>> multiply/divide the mantissa and add/sub the exponent.
>>>>>>
>>>>>> With FP add/sub you have to denormalize one operand and then
>>>>>> normalize the result, which can be quite time consuming, without
>>>>>> sufficient HW support.
>>>>>>
>>>>>> This can be really time consuming, if the HW is designed by an
>>>>>> idiot.
>>>>>
>>>>> Your observations are valid.  But I have yet to see a practical
>>>>> example of something that can be done faster and with equal accuracy
>>>>> in floating point vs. using integer operations.
>>>>>
>>>>>
>>>> It depends on the chip, the type of floating point hardware it has,
>>>> the operations you need, the compiler, and the code quality.  For a
>>>> lot of heavy calculations done with integer arithmetic, you need a
>>>> number of "extra" instructions as well as the basic add, subtract,
>>>> multiply and divides.  You might need shifts for scaling, mask
>>>> operations, extra code to get the signs right, etc.  And the paths for
>>>> these are likely to be highly serialised, with each depending directly
>>>> on the results of the previous operation, which slows down pipelining.
>>>>   With hardware floating point, you have a much simpler instruction
>>>> stream, which can result in faster throughput even if the actual
>>>> latency for the calculations is the same.
>>>>
>>>> This effect increases with the size and complexity of the processor.
>>>> Obviously it is dependent on the processor having floating point
>>>> hardware for the precision needed (single or double), but once you
>>>> have any sort of hardware floating point you should re-check all your
>>>> assumptions about speed differences.  You could be wrong in either
>>>> direction.
>>>
>>> The key point is "it is dependent on the processor having floating
>>> point hardware for the precision needed".  And, I might add, on other
>>> things --
>>> see Walter Banks's comments in another sub-thread about 32-bit floating
>>> point vs. 32-bit integer math.
>>>
>>> In my experience with signal processing and control loops, having a
>>> library that implements fixed-point, fractional arithmetic with
>>> saturation on addition and shift-up is often faster that floating point
>>> _or_ "pure" integer math, and sidesteps a number of problems with both.
>>> It's at the cost of a learning curve with anyone using the package, but
>>> it works well.
>>>
>>> On all the processors I've tried it except for x86 processors, there's
>>> been a 3-20x speedup once I've hand-written the assembly code to do the
>>> computation (and that's without understanding or trying to accommodate
>>> any pipelines that may exist).
>>
>> Weren't you the one that said that your (tuned) ARM C code was generally
>> only a factor of 1.2 worse than the best hand-tweaked assembly code?
>> Maybe not, but I've seen it said in these parts.  Certainly, my
>> experience is that that is quite good rule of thumb, and it is very
>> difficult to get more than a factor of two between assembler and C
>> unless the platform in question has a very poor C compiler or the
>> assembly code is actually implementing a different algorithm (which is
>> sometimes possible, but much rarer in these days of well-supplied
>> intrinsic function libraries.)
>
> When the compiler can figure out what I mean, yes, it is usually at least
> almost as good as I can do, and sometimes better (I don't carry around
> all the instruction reordering rules in my head: the compiler does).
>
> With fixed-point arithmetic stuff, though, the compiler never seems to
> "get it".
>
>>> But on the x86 -- which is the _only_ processor that I've tried it that
>>> had floating point -- 32-bit fractional arithmetic is slower than
>>> 64-bit floating point.
>>
>> One thing that gives float a particualr edge on the x86(32) (but which
>> can also apply to other processors) is that using floating point means
>> that you don't have to use the precious integer register set for data:
>> it can be used for pointers, counters and other control periphera,
>> leaving the working "data state" in the FPU registers.  Modern SIMD
>> units can do integer operations as well as floating point, so the "extra
>> state" argument might seem weaker, but I've never seen a compiler use	
>> SIMD registers for integer calculations (unless forced to with intrinsic
>> functions).
>
> So, the next time I try this on x86 I should use the SIMD registers.
>

This is one of the reasons why it is best to use a modern compiler for 
big processors - it is hard to keep up with them when working by hand. 
On small devices, you can learn all you need to know about the cpu - but 
for modern x86 devices, it is just too much effort.  And if you are 
trying to generate the fastest possible code, it varies significantly 
between different x86 models - your fine-tuned hand-coded assembly may 
run optimally on the cpu you have on your machine today, but poorly on 
another machine.

> Actually, if you know you're going to be doing things like vector dot
> products, then you could probably get some significant speed-up by doing
> a spot of assembly here and there.  I haven't had occasion to try this on
> an x86, though.

For particularly complex vector work, hand-coding the SIMD instructions 
is essential for optimal speed.  But compilers are getting surprisingly 
good at generating some of this stuff semi-automatically - it is worth 
trying the compiler's SIMD support before doing it by hand.  The other 
option is libraries - Intel in particular provides optimised libraries 
for this sort of stuff.

>
>>> So, yes -- whether integer (or fixed point) arithmetic is going to be
>>> faster than floating point depends _a lot_ on the processor.  So
>>> instead of automatically deciding to do everything "the hard way" and
>>> feeling clever and virtuous thereby, you should _benchmark_ the
>>> performance of a code sample with floating point vs. whatever
>>> fixed-point poison you choose.
>>
>> Fast isn't always the only consideration, though.  Floating point is
>> *always* going to be more power-hungry than fixed point, simply because
>> it is doing a bunch of extra work at run-time that fixed-point forces
>> you to hoist to compile-time.
>

That is a wildly inaccurate generalisation.  For small processors, the 
power consumption is going to depend on the speed of the calculations - 
these cores are all-or-nothing in their power usage, so doing the work 
faster means you can go to sleep sooner.  So faster is lower power.  For 
larger processors, there may be dynamic clock enabling of different 
parts - if the hardware floating point unit is not used, it can be 
powered-down.  Then there is a trade-off - do you spend extra time in 
the integer units, or do you do the job faster with the power-hungry 
floating point unit?  The answer will vary there too, but typically 
faster means less energy overall.

It is obviously correct that the more work that is done at compile time 
the better - it is only run-time that takes power (on the target).  But 
I can think of no justification for claiming that fixed-point algorithms 
will do more at compile-time than floating-point algorithms - I would 
expect the floating-point code to do far more compile-time optimisation 
and pre-calculation (since the compiler has a better understanding of 
the code in question).

> It'll be power hungry twice if you select a chip that has floating point
> hardware.  I never seem to have the budget -- either dollars or watts --
> to use such processors.
>
>> The advice to benchmark is excellent, of course.  Particularly because
>> the results won't necessarily be what you expect.
>
> Yes.  Even when I expect anti-intuitive results, I can still be
> astonished by benchmarks.
>

[toc] | [prev] | [next] | [standalone]


#7832

FromMark Borgerson <mborgerson@comcast.net>
Date2012-03-29 07:56 -0700
Message-ID<MPG.29dd600a23f04ec9898c9@news.eternal-september.org>
In reply to#7821
In article <oKCdnSXl7OPQPe7SnZ2dnUVZ_gednZ2d@web-ster.com>, 
tim@seemywebsite.com says...
> 
> On Wed, 28 Mar 2012 22:44:32 +0000, Andrew Reilly wrote:
> 
<<SNIP>>
> > Fast isn't always the only consideration, though.  Floating point is
> > *always* going to be more power-hungry than fixed point, simply because
> > it is doing a bunch of extra work at run-time that fixed-point forces
> > you to hoist to compile-time.
> 
> It'll be power hungry twice if you select a chip that has floating point 
> hardware.  I never seem to have the budget -- either dollars or watts -- 
> to use such processors.

Cortex M4 chips,like the STM32F405 have lowered the bars quite a bit for 
FPU availability.  STm32F405 is about $11.5 qty 1 at DigiKey.  The 
STM32F205 Cortex M3 is about the same price.

I've got one of the chips, and it's compatible with the F205 board I 
designed, so I'll be trying it out soon.  More RAM, more Flash, faster
clock----everything we look forward to in a new generation of chips.
(since I'm not using an OS or big USB or ethernet stacks, I'll have LOTS
of flash left over for things like lookup tables, etc.)

Right now, I'm just happy to read an SD card and send bit-banged data to 
an FT232H at about  6MB/second.   I can even use the same drivers and 
host I use with the FT245 chips which do the same thing at about 
200KB/s.  The 4-bit SD interface on the STM chips can do multi-block
reads at upwards of 10MB/s.   Hard to match that with SPI mode!

> 
> > The advice to benchmark is excellent, of course.  Particularly because
> > the results won't necessarily be what you expect.
> 
> Yes.  Even when I expect anti-intuitive results, I can still be 
> astonished by benchmarks.

I think the FPU availability will greatly simplify coding of things like
Extended Kalman Filters and digital signal processing apps.  You can 
write and test code on a PC while specifying 32-bit floats and port 
pretty easily to the MPU system.


Mark Borgerson



[toc] | [prev] | [next] | [standalone]


#7840

FromTim Wescott <tim@seemywebsite.com>
Date2012-03-29 16:52 -0500
Message-ID<rZydncI478wfROnSnZ2dnUVZ_ridnZ2d@web-ster.com>
In reply to#7832
On Thu, 29 Mar 2012 07:56:50 -0700, Mark Borgerson wrote:

> In article <oKCdnSXl7OPQPe7SnZ2dnUVZ_gednZ2d@web-ster.com>,
> tim@seemywebsite.com says...
>> 
>> On Wed, 28 Mar 2012 22:44:32 +0000, Andrew Reilly wrote:
>> 
> <<SNIP>>
>> > Fast isn't always the only consideration, though.  Floating point is
>> > *always* going to be more power-hungry than fixed point, simply
>> > because it is doing a bunch of extra work at run-time that
>> > fixed-point forces you to hoist to compile-time.
>> 
>> It'll be power hungry twice if you select a chip that has floating
>> point hardware.  I never seem to have the budget -- either dollars or
>> watts -- to use such processors.
> 
> Cortex M4 chips,like the STM32F405 have lowered the bars quite a bit for
> FPU availability.  STm32F405 is about $11.5 qty 1 at DigiKey.  The
> STM32F205 Cortex M3 is about the same price.
> 
> I've got one of the chips, and it's compatible with the F205 board I
> designed, so I'll be trying it out soon.  More RAM, more Flash, faster
> clock----everything we look forward to in a new generation of chips.
> (since I'm not using an OS or big USB or ethernet stacks, I'll have LOTS
> of flash left over for things like lookup tables, etc.)
> 
> Right now, I'm just happy to read an SD card and send bit-banged data to
> an FT232H at about  6MB/second.   I can even use the same drivers and
> host I use with the FT245 chips which do the same thing at about
> 200KB/s.  The 4-bit SD interface on the STM chips can do multi-block
> reads at upwards of 10MB/s.   Hard to match that with SPI mode!
> 
> 
>> > The advice to benchmark is excellent, of course.  Particularly
>> > because the results won't necessarily be what you expect.
>> 
>> Yes.  Even when I expect anti-intuitive results, I can still be
>> astonished by benchmarks.
> 
> I think the FPU availability will greatly simplify coding of things like
> Extended Kalman Filters and digital signal processing apps.  You can
> write and test code on a PC while specifying 32-bit floats and port
> pretty easily to the MPU system.

Be careful of 32-bit floating point.  It is insufficient for a number of 
real-world tasks for which 32-bit fixed point is well suited.  IEEE 
single-precision floating point gives you (effectively) a 25- or 26-bit 
mantissa (I can't remember how many bits it is, plus sign, plus implied 
1).  When integrator gains get low, that's not enough, where the extra 
factor of 128 or 64 available from well-scaled fixed point will save the 
day.

Be _very_ careful of 32-bit floating point in an Extended Kalman filter.  
Particularly if you're not using a square-root algorithm for the 
evolution of the variance matrix.  You can run out of precision 
astonishingly quickly.

-- 
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com

[toc] | [prev] | [next] | [standalone]


#7842

FromMark Borgerson <mborgerson@comcast.net>
Date2012-03-29 21:19 -0700
Message-ID<MPG.29ded4142f5487ab9898ca@news.eternal-september.org>
In reply to#7840
In article <rZydncI478wfROnSnZ2dnUVZ_ridnZ2d@web-ster.com>, 
tim@seemywebsite.com says...
> 
> On Thu, 29 Mar 2012 07:56:50 -0700, Mark Borgerson wrote:
> 
> > In article <oKCdnSXl7OPQPe7SnZ2dnUVZ_gednZ2d@web-ster.com>,
> > tim@seemywebsite.com says...
> >> 
> >> On Wed, 28 Mar 2012 22:44:32 +0000, Andrew Reilly wrote:
> >> 
> > <<SNIP>>
> >> > Fast isn't always the only consideration, though.  Floating point is
> >> > *always* going to be more power-hungry than fixed point, simply
> >> > because it is doing a bunch of extra work at run-time that
> >> > fixed-point forces you to hoist to compile-time.
> >> 
> >> It'll be power hungry twice if you select a chip that has floating
> >> point hardware.  I never seem to have the budget -- either dollars or
> >> watts -- to use such processors.
> > 
> > Cortex M4 chips,like the STM32F405 have lowered the bars quite a bit for
> > FPU availability.  STm32F405 is about $11.5 qty 1 at DigiKey.  The
> > STM32F205 Cortex M3 is about the same price.
> > 
> > I've got one of the chips, and it's compatible with the F205 board I
> > designed, so I'll be trying it out soon.  More RAM, more Flash, faster
> > clock----everything we look forward to in a new generation of chips.
> > (since I'm not using an OS or big USB or ethernet stacks, I'll have LOTS
> > of flash left over for things like lookup tables, etc.)
> > 
> > Right now, I'm just happy to read an SD card and send bit-banged data to
> > an FT232H at about  6MB/second.   I can even use the same drivers and
> > host I use with the FT245 chips which do the same thing at about
> > 200KB/s.  The 4-bit SD interface on the STM chips can do multi-block
> > reads at upwards of 10MB/s.   Hard to match that with SPI mode!
> > 
> > 
> >> > The advice to benchmark is excellent, of course.  Particularly
> >> > because the results won't necessarily be what you expect.
> >> 
> >> Yes.  Even when I expect anti-intuitive results, I can still be
> >> astonished by benchmarks.
> > 
> > I think the FPU availability will greatly simplify coding of things like
> > Extended Kalman Filters and digital signal processing apps.  You can
> > write and test code on a PC while specifying 32-bit floats and port
> > pretty easily to the MPU system.
> 
> Be careful of 32-bit floating point.  It is insufficient for a number of 
> real-world tasks for which 32-bit fixed point is well suited.  IEEE 
> single-precision floating point gives you (effectively) a 25- or 26-bit 
> mantissa (I can't remember how many bits it is, plus sign, plus implied 
> 1).  When integrator gains get low, that's not enough, where the extra 
> factor of 128 or 64 available from well-scaled fixed point will save the 
> day.

IIRC, IEEE-854 is 8 bits exponent (offset by  128 ), one bit sign and 
23-bit mantissa with an implied 1 bit as the 24th bit.

That's probably OK for FIR filters working on the results of 16-bit ADCs
as long as the number of terms is reasonable (<30 or so).
OTOH, I handled those calculations nicely on and MSP430 with the onboard
16x16 Bit hardware multiply and accumulate.  When I set up the 
coefficients properly, I didn't even have to do a divide of the sum.  I 
just picked the high 16-bit word----an effective divide by 65536.

Matlab allows me to generate filters with 16 and 32 bit integers and 32
and 64-bit FP.  If I translate from MSP430 to Cortex,  I would probably 
just translate the filters to 32-bit integer and save the FPU for 
things that might exceed the dynamic range of the 32-bit integers.

> 
> Be _very_ careful of 32-bit floating point in an Extended Kalman filter.  
> Particularly if you're not using a square-root algorithm for the 
> evolution of the variance matrix.  You can run out of precision 
> astonishingly quickly.

Thanks for the notes.  I looked up the last time I ported someone else's
code to a StrongArm processor.  They did use doubles (64-bit FP).  The
chip didn't have an FPU and was running Linux.  The standard FP library 
implementation did all the floating point calculations with software
interrupts  and performance truly sucked.  We ended up revising all the
code to use a special library that didn't use SWIs.   It was still
not as fast as we wanted.  I'm not sure how much a 32-bit FPU will help
with 64-bit FP calculations.  One of these days I'll take a closer look
at the IAR and STM signal processing libraries.

Mark Borgerson

[toc] | [prev] | [next] | [standalone]


#7844

FromTim Wescott <tim@seemywebsite.please>
Date2012-03-30 00:42 -0500
Message-ID<88SdnUtdIY4v2ujSnZ2dnUVZ_qudnZ2d@web-ster.com>
In reply to#7842
On Thu, 29 Mar 2012 21:19:03 -0700, Mark Borgerson wrote:

> In article <rZydncI478wfROnSnZ2dnUVZ_ridnZ2d@web-ster.com>,
> tim@seemywebsite.com says...
>> 
>> On Thu, 29 Mar 2012 07:56:50 -0700, Mark Borgerson wrote:
>> 
>> > In article <oKCdnSXl7OPQPe7SnZ2dnUVZ_gednZ2d@web-ster.com>,
>> > tim@seemywebsite.com says...
>> >> 
>> >> On Wed, 28 Mar 2012 22:44:32 +0000, Andrew Reilly wrote:
>> >> 
>> > <<SNIP>>
>> >> > Fast isn't always the only consideration, though.  Floating point
>> >> > is *always* going to be more power-hungry than fixed point, simply
>> >> > because it is doing a bunch of extra work at run-time that
>> >> > fixed-point forces you to hoist to compile-time.
>> >> 
>> >> It'll be power hungry twice if you select a chip that has floating
>> >> point hardware.  I never seem to have the budget -- either dollars
>> >> or watts -- to use such processors.
>> > 
>> > Cortex M4 chips,like the STM32F405 have lowered the bars quite a bit
>> > for FPU availability.  STm32F405 is about $11.5 qty 1 at DigiKey. 
>> > The STM32F205 Cortex M3 is about the same price.
>> > 
>> > I've got one of the chips, and it's compatible with the F205 board I
>> > designed, so I'll be trying it out soon.  More RAM, more Flash,
>> > faster clock----everything we look forward to in a new generation of
>> > chips. (since I'm not using an OS or big USB or ethernet stacks, I'll
>> > have LOTS of flash left over for things like lookup tables, etc.)
>> > 
>> > Right now, I'm just happy to read an SD card and send bit-banged data
>> > to an FT232H at about  6MB/second.   I can even use the same drivers
>> > and host I use with the FT245 chips which do the same thing at about
>> > 200KB/s.  The 4-bit SD interface on the STM chips can do multi-block
>> > reads at upwards of 10MB/s.   Hard to match that with SPI mode!
>> > 
>> > 
>> >> > The advice to benchmark is excellent, of course.  Particularly
>> >> > because the results won't necessarily be what you expect.
>> >> 
>> >> Yes.  Even when I expect anti-intuitive results, I can still be
>> >> astonished by benchmarks.
>> > 
>> > I think the FPU availability will greatly simplify coding of things
>> > like Extended Kalman Filters and digital signal processing apps.  You
>> > can write and test code on a PC while specifying 32-bit floats and
>> > port pretty easily to the MPU system.
>> 
>> Be careful of 32-bit floating point.  It is insufficient for a number
>> of real-world tasks for which 32-bit fixed point is well suited.  IEEE
>> single-precision floating point gives you (effectively) a 25- or 26-bit
>> mantissa (I can't remember how many bits it is, plus sign, plus implied
>> 1).  When integrator gains get low, that's not enough, where the extra
>> factor of 128 or 64 available from well-scaled fixed point will save
>> the day.
> 
> IIRC, IEEE-854 is 8 bits exponent (offset by  128 ), one bit sign and
> 23-bit mantissa with an implied 1 bit as the 24th bit.
> 
> That's probably OK for FIR filters working on the results of 16-bit ADCs
> as long as the number of terms is reasonable (<30 or so). OTOH, I
> handled those calculations nicely on and MSP430 with the onboard 16x16
> Bit hardware multiply and accumulate.  When I set up the coefficients
> properly, I didn't even have to do a divide of the sum.  I just picked
> the high 16-bit word----an effective divide by 65536.
> 
> Matlab allows me to generate filters with 16 and 32 bit integers and 32
> and 64-bit FP.  If I translate from MSP430 to Cortex,  I would probably
> just translate the filters to 32-bit integer and save the FPU for things
> that might exceed the dynamic range of the 32-bit integers.
> 
It gets to be an issue when you're implementing IIR filters or PID 
controllers where the bandwidth of the filter or loop is much smaller 
than the sampling rate: in those circumstances, the difference between 
the maximum size of an accumulator and the size of an increment that 
needs to affect it can get to be a healthy portion of -- or more than -- 
2^25, and then you're screwed.

> 
>> Be _very_ careful of 32-bit floating point in an Extended Kalman
>> filter. Particularly if you're not using a square-root algorithm for
>> the evolution of the variance matrix.  You can run out of precision
>> astonishingly quickly.
> 
> Thanks for the notes.  I looked up the last time I ported someone else's
> code to a StrongArm processor.  They did use doubles (64-bit FP).  The
> chip didn't have an FPU and was running Linux.  The standard FP library
> implementation did all the floating point calculations with software
> interrupts  and performance truly sucked.  We ended up revising all the
> code to use a special library that didn't use SWIs.   It was still not
> as fast as we wanted.  I'm not sure how much a 32-bit FPU will help with
> 64-bit FP calculations.  One of these days I'll take a closer look at
> the IAR and STM signal processing libraries.

If I needed to implement a Kalman filter on a processor that would take a 
significant speed hit going to 64-bit floating point I'd take a close 
look at the square root algorithms.  The basic idea is that you have to 
do more computation to carry the square root of the variance, but because 
it's a square root you pretty much cut your needed precision in half.

On a PC I rather suspect that using a square root algorithm would be a 
stupid waste of time -- but if brand B can do 32-bit floating point 50 
times faster than 64-bit, the square root algorithm would probably win 
hands down.

-- 
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com

[toc] | [prev] | [next] | [standalone]


#7822

Fromupsidedown@downunder.com
Date2012-03-29 07:19 +0300
Message-ID<qco7n71gb7jlejacpqjo44toghpu263lrf@4ax.com>
In reply to#7820
On 28 Mar 2012 22:44:32 GMT, Andrew Reilly <areilly---@bigpond.net.au>
wrote:

>Weren't you the one that said that your (tuned) ARM C code was generally 
>only a factor of 1.2 worse than the best hand-tweaked assembly code?  
>Maybe not, but I've seen it said in these parts.  Certainly, my 
>experience is that that is quite good rule of thumb, and it is very 
>difficult to get more than a factor of two between assembler and C unless 
>the platform in question has a very poor C compiler or the assembly code 
>is actually implementing a different algorithm (which is sometimes 
>possible, but much rarer in these days of well-supplied intrinsic 
>function libraries.)

The main problem trying to write _low_level_ math routines in C is
that you do not have access to the carry bit or use any rotate
instruction. The C-compiler would have to be very clever to convert a
sequence of C-statement into a single rotate instruction or shifting
multiple bits into two registers.

[toc] | [prev] | [next] | [standalone]


#7828

FromAndrew Reilly <areilly---@bigpond.net.au>
Date2012-03-29 11:53 +0000
Message-ID<9tj0pvFmp3U1@mid.individual.net>
In reply to#7822
On Thu, 29 Mar 2012 07:19:02 +0300, upsidedown wrote:

> On 28 Mar 2012 22:44:32 GMT, Andrew Reilly <areilly---@bigpond.net.au>
> wrote:
> 
>>Weren't you the one that said that your (tuned) ARM C code was generally
>>only a factor of 1.2 worse than the best hand-tweaked assembly code?
>>Maybe not, but I've seen it said in these parts.  Certainly, my
>>experience is that that is quite good rule of thumb, and it is very
>>difficult to get more than a factor of two between assembler and C
>>unless the platform in question has a very poor C compiler or the
>>assembly code is actually implementing a different algorithm (which is
>>sometimes possible, but much rarer in these days of well-supplied
>>intrinsic function libraries.)
> 
> The main problem trying to write _low_level_ math routines in C is that
> you do not have access to the carry bit or use any rotate instruction.
> The C-compiler would have to be very clever to convert a sequence of
> C-statement into a single rotate instruction or shifting multiple bits
> into two registers.

It's a funny old world.  I've seen several compilers recognise the pair 
of shifts and an or combination as a rotate, and emit that instruction.  
I've also replaced carefully asm-"optimised" maths routines (on x86) that 
used the carry flag with "vanilla" C equivalents, and the overall effect 
was a fairly dramatic performance improvement.  Not sure whether it was a 
side effect of the assembly code pinning registers that could otherwise 
have been reassigned, or some subtle consequence of reduced dependency, 
but the result was clear.  Guessing performace on massively superscalar, 
out-of-order processors like modern x86-64 is very difficult, IMO.

Intrinsic functions (to get access to things like clz and similar) also 
help a lot.

Benchmarking is important.

Milage will definitely vary with target and toolchain...

Cheers,

-- 
Andrew

[toc] | [prev] | [next] | [standalone]


#7831

FromWalter Banks <walter@bytecraft.com>
Date2012-03-29 09:40 -0500
Message-ID<4F74745B.D07F58B1@bytecraft.com>
In reply to#7828

Andrew Reilly wrote:

> Benchmarking is important.
>
> Milage will definitely vary with target and toolchain...
>

Nothing wakes me up faster than strong coffee than
last nights benchmark results. Benchmarking code
fragments are important, but benchmarking applications
can be a real eyeopener.

There is nothing more humbling than adding a clever
optimization to a compiler and discovering that 75%
of the regression applications just got slower and
larger as a result.



Walter..




[toc] | [prev] | [next] | [standalone]


#7838

Fromupsidedown@downunder.com
Date2012-03-29 23:46 +0300
Message-ID<02i9n79nq8un0dlh1ukmh0446ea2sijv10@4ax.com>
In reply to#7828
On 29 Mar 2012 11:53:36 GMT, Andrew Reilly <areilly---@bigpond.net.au>
wrote:

>> The main problem trying to write _low_level_ math routines in C is that
>> you do not have access to the carry bit or use any rotate instruction.
>> The C-compiler would have to be very clever to convert a sequence of
>> C-statement into a single rotate instruction or shifting multiple bits
>> into two registers.
>
>It's a funny old world.  I've seen several compilers recognise the pair 
>of shifts and an or combination as a rotate, and emit that instruction.  
>I've also replaced carefully asm-"optimised" maths routines (on x86) that 
>used the carry flag with "vanilla" C equivalents, and the overall effect 
>was a fairly dramatic performance improvement.  Not sure whether it was a 
>side effect of the assembly code pinning registers that could otherwise 
>have been reassigned, or some subtle consequence of reduced dependency, 
>but the result was clear.  Guessing performace on massively superscalar, 
>out-of-order processors like modern x86-64 is very difficult, IMO.

The x86 family is a bit strange case. The number of cycles required by
trivial integer operations (adds, shifts) compared to more complex
instructions like integer mul/div is nearly 1:1 and the floating point
variants are not much worse. Even some complex cases such as floating
point sin/cos are handled quite quickly.

One might even argue that the relative performance for primitive
operations like shifts and adds are quite poor on x86 processors,
compared to computationally intensive operations like sin/cos
(requiring 3rd-8th order polynomial).

[toc] | [prev] | [next] | [standalone]


#7830

FromWalter Banks <walter@bytecraft.com>
Date2012-03-29 09:28 -0500
Message-ID<4F747184.D809F80@bytecraft.com>
In reply to#7822

upsidedown@downunder.com wrote:

> On 28 Mar 2012 22:44:32 GMT, Andrew Reilly <areilly---@bigpond.net.au>
> wrote:
>
> >Weren't you the one that said that your (tuned) ARM C code was generally
> >only a factor of 1.2 worse than the best hand-tweaked assembly code?
> >Maybe not, but I've seen it said in these parts.  Certainly, my
> >experience is that that is quite good rule of thumb, and it is very
> >difficult to get more than a factor of two between assembler and C unless
> >the platform in question has a very poor C compiler or the assembly code
> >is actually implementing a different algorithm (which is sometimes
> >possible, but much rarer in these days of well-supplied intrinsic
> >function libraries.)
>
> The main problem trying to write _low_level_ math routines in C is
> that you do not have access to the carry bit or use any rotate
> instruction. The C-compiler would have to be very clever to convert a
> sequence of C-statement into a single rotate instruction or shifting
> multiple bits into two registers.

C compilers have been gaining performance in part because compiler
designers are targeting with both a target and a subset of applications
in mind.

Most compiler developers are benchmarking "real" applications that
are tending to direct the compiler to optimize those applications. The
result is compilers used in the embedded systems market can often
do some very low level optimization very well that would not be
available or even considered for compilers used in other applications.

For embedded systems specifically most if not all commercial compilers
have some mechanism to access the processor condition codes. Most
embedded system compilers do well at using the whole processor
instruction set.


Walter..
--
Walter Banks
Byte Craft Limited
http://www.bytecraft.com








[toc] | [prev] | [next] | [standalone]


#7833

FromDavid Brown <david@westcontrol.removethisbit.com>
Date2012-03-29 16:58 +0200
Message-ID<foCdneZMYuYT5-nSnZ2dnUVZ8i6dnZ2d@lyse.net>
In reply to#7830
On 29/03/2012 16:28, Walter Banks wrote:
>
>
> upsidedown@downunder.com wrote:
>
>> On 28 Mar 2012 22:44:32 GMT, Andrew Reilly<areilly---@bigpond.net.au>
>> wrote:
>>
>>> Weren't you the one that said that your (tuned) ARM C code was generally
>>> only a factor of 1.2 worse than the best hand-tweaked assembly code?
>>> Maybe not, but I've seen it said in these parts.  Certainly, my
>>> experience is that that is quite good rule of thumb, and it is very
>>> difficult to get more than a factor of two between assembler and C unless
>>> the platform in question has a very poor C compiler or the assembly code
>>> is actually implementing a different algorithm (which is sometimes
>>> possible, but much rarer in these days of well-supplied intrinsic
>>> function libraries.)
>>
>> The main problem trying to write _low_level_ math routines in C is
>> that you do not have access to the carry bit or use any rotate
>> instruction. The C-compiler would have to be very clever to convert a
>> sequence of C-statement into a single rotate instruction or shifting
>> multiple bits into two registers.
>
> C compilers have been gaining performance in part because compiler
> designers are targeting with both a target and a subset of applications
> in mind.
>
> Most compiler developers are benchmarking "real" applications that
> are tending to direct the compiler to optimize those applications. The
> result is compilers used in the embedded systems market can often
> do some very low level optimization very well that would not be
> available or even considered for compilers used in other applications.
>
> For embedded systems specifically most if not all commercial compilers
> have some mechanism to access the processor condition codes. Most
> embedded system compilers do well at using the whole processor
> instruction set.
>

I think it is a bit of an exaggeration to say this applies to "most" 
commercial compilers - and it is certainly not all.  I think it applies 
to a /few/ commercial compilers targeted at particularly small processors.

For larger processors, you don't get access to the condition codes from 
C - it would mess up the compiler code generation patterns too much. 
For big processors, a compiler needs to track condition codes over 
series of instructions - if the programmer can fiddle with condition 
codes in the middle of an instruction stream, the compiler would lose track.

Also for larger processors, there are often many instruction codes (or 
addressing modes) that are never generated by the compiler.  Some 
instructions are just too weird to map properly to C code, others cannot 
be expressed in C at all.  As a programmer, you access these using 
library calls, inline assembly, or "intrinsic" functions (which are 
simply ready-made inline assembly functions).

You write compilers targeted for small and limited processors, and have 
very fine-tuned optimisations and extensions to let developers squeeze 
maximum performance from such devices.  But don't judge "most if not 
all" commercial compilers by your own standards - most do not measure up 
in those areas.

[toc] | [prev] | [next] | [standalone]


#7823

FromDavid Brown <david@westcontrol.removethisbit.com>
Date2012-03-29 10:09 +0200
Message-ID<-9SdnUO6as8qh-nSnZ2dnUVZ8l6dnZ2d@lyse.net>
In reply to#7815
On 28/03/2012 19:20, Tim Wescott wrote:
> On Wed, 28 Mar 2012 09:17:14 +0200, David Brown wrote:
>
>> On 27/03/2012 19:02, David T. Ashley wrote:
>>> On Tue, 27 Mar 2012 18:52:09 +0300, upsidedown@downunder.com wrote:
>>>
>>>> On Tue, 27 Mar 2012 11:28:18 -0400, David T. Ashley
>>>> <dashley@gmail.com>   wrote:
>>>>
>>>>
>>>>> Without FPU support, assuming that the processor has basic integer
>>>>> multiplication instructions, integer operations are ALWAYS faster
>>>>> than floating-point operations.  Usually _far_ faster.  And always
>>>>> more precise.
>>>>
>>>> Floating point instructions MUL/DIV are trivial, just multiply/divide
>>>> the mantissa and add/sub the exponent.
>>>>
>>>> With FP add/sub you have to denormalize one operand and then normalize
>>>> the result, which can be quite time consuming, without sufficient HW
>>>> support.
>>>>
>>>> This can be really time consuming, if the HW is designed by an idiot.
>>>
>>> Your observations are valid.  But I have yet to see a practical example
>>> of something that can be done faster and with equal accuracy in
>>> floating point vs. using integer operations.
>>>
>>>
>> It depends on the chip, the type of floating point hardware it has, the
>> operations you need, the compiler, and the code quality.  For a lot of
>> heavy calculations done with integer arithmetic, you need a number of
>> "extra" instructions as well as the basic add, subtract, multiply and
>> divides.  You might need shifts for scaling, mask operations, extra code
>> to get the signs right, etc.  And the paths for these are likely to be
>> highly serialised, with each depending directly on the results of the
>> previous operation, which slows down pipelining.  With hardware floating
>> point, you have a much simpler instruction stream, which can result in
>> faster throughput even if the actual latency for the calculations is the
>> same.
>>
>> This effect increases with the size and complexity of the processor.
>> Obviously it is dependent on the processor having floating point
>> hardware for the precision needed (single or double), but once you have
>> any sort of hardware floating point you should re-check all your
>> assumptions about speed differences.  You could be wrong in either
>> direction.
>
> The key point is "it is dependent on the processor having floating point
> hardware for the precision needed".  And, I might add, on other things --
> see Walter Banks's comments in another sub-thread about 32-bit floating
> point vs. 32-bit integer math.
>

Yes (see my reply on that thread).

> In my experience with signal processing and control loops, having a
> library that implements fixed-point, fractional arithmetic with
> saturation on addition and shift-up is often faster that floating point
> _or_ "pure" integer math, and sidesteps a number of problems with both.
> It's at the cost of a learning curve with anyone using the package, but
> it works well.
>

When you add things like saturation into the mix, it gets more 
complicated.  That is going to be much less overhead for integer 
arithmetic than for floating point (unless you have a processor that has 
hardware support for floating point saturated instructions).

But yes, a well-written library is normally going to be better than 
poorly written "direct" code, as well as saving you from having to get 
all the little details correct (you shouldn't worry about your code 
being fast until you are sure it is correct!).  A lot of ready-made 
libraries are not well written, however, or have other disadvantages. 
I've seen libraries that were compiled without optimisation - and were 
thus far slower than necessary.  And many libraries are full of 
hand-made assembly that is out of date, yet remains there for historic 
reasons even when it now does more harm than good.

Like everything in this field, there are no simple answers.

> On all the processors I've tried it except for x86 processors, there's
> been a 3-20x speedup once I've hand-written the assembly code to do the
> computation (and that's without understanding or trying to accommodate
> any pipelines that may exist).

While x86 typically means "desktop" rather than "embedded", there are 
steadily more powerful cpu's making their way into the embedded space. 
I've been using some PowerPC cores recently, and see there's a large 
number of factors that affect the real-world speed of the code.  Often 
floating point (when supported by hardware) will be faster than scaled 
integer code, and C code will often be much faster than hand-written 
assembly (because it is hard for the assembly programmer to track 
pipelines or to make full use of the core's weirder instructions).

>
> But on the x86 -- which is the _only_ processor that I've tried it that
> had floating point -- 32-bit fractional arithmetic is slower than 64-bit
> floating point.
>
> So, yes -- whether integer (or fixed point) arithmetic is going to be
> faster than floating point depends _a lot_ on the processor.  So instead
> of automatically deciding to do everything "the hard way" and feeling
> clever and virtuous thereby, you should _benchmark_ the performance of a
> code sample with floating point vs. whatever fixed-point poison you
> choose.

Absolutely.

>
> Then, even if fixed point is significantly faster, you should look at the
> time consumed by floating point and ask if it's really necessary to save
> that time: even cheapo 8-bit processors run pretty fast these days, and
> can implement fairly complex control laws at 10 or even 100Hz using
> double-precision floating point arithmetic.  If floating point will do,
> fixed point is a waste of effort.  And if floating point is _faster_,
> fixed point is just plain stupid.
>

It's always tempting to worry too much about speed, and work hard to get 
the fastest solution.  But if you've got code that works correctly, is 
high quality (clear, reliable, maintainable, etc.), and runs fast enough 
for the job - then you are finished.  It doesn't matter if you could run 
faster by switching to floating point or fixed point - good enough is 
good enough.

> So, benchmark, think, make an informed decision, and then that virtuous
> glow that surrounds you after you make your decision will be earned.
>

Yes.

[toc] | [prev] | [next] | [standalone]


#7875

FromClifford Heath <cjh@no.spam.please.net>
Date2012-04-01 18:08 +1000
Message-ID<m2Udr.26667$cd7.4320@newsfe06.iad>
In reply to#7815
On 03/29/12 03:20, Tim Wescott wrote:
> But on the x86 -- which is the _only_ processor that I've tried it that
> had floating point -- 32-bit fractional arithmetic is slower than 64-bit
> floating point.

I think I recall that transition point occurring around 1994.

I was writing a scalable vector graphics subsystem, and carefully using
integer (sometimes fixed-point) math wherever possible, only to find that,
when I changed the basic type of the coordinate to float (or double, I
can't recall) the system actually rendered *faster*.

The integer unit was busy computing addresses and array offsets, and
being interrupted with *coordinate* math, while the FPU lay idle.

This was still in the Pentium days, before even the 686 and PII.

On a modern note, has anyone tried to use the TI OMAP ARM CPUs?
I haven't looked at the DSP instruction set, but the hardware FP is sweet.

Clifford Heath.

[toc] | [prev] | [next] | [standalone]


#7893

Fromdp <dp@tgi-sci.com>
Date2012-03-28 02:38 -0700
Message-ID<86a5f141-6e67-46e6-9f5a-0d0cab7b0903@l14g2000vbe.googlegroups.com>
In reply to#7813
On Mar 28, 10:17 am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> ... And the paths for these are likely to be
> highly serialised, with each depending directly on the results of the
> previous operation, which slows down pipelining.  With hardware floating
> point, you have a much simpler instruction stream, which can result in
> faster throughput even if the actual latency for the calculations is the
> same.

Hi David,
this reminds me of something I was through not so long ago.
On the MPC5200B one gets a good FPU, and I was beginning to use it
for DSP purposes (using the FMADD.D opcode, 64 bit FP MAC, that is
64*64+64) .
It is specified at 2 cycles per FMADD opcode.
I did a loop with just one FMADD inside and guess what, I got 25
(or was it 35) cycles... Data dependencies, obviously. I had to
spread the loop over 24+ FP registers in order to eliminate the
data dependencies (well, and hid some of the data & coeeficients
load/store as a bonus) and got an average of 5.5 cycles eventually
IIRC, well, somewhat <6 anyway (including memory accesses).

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

[toc] | [prev] | [next] | [standalone]


#7819

Fromupsidedown@downunder.com
Date2012-03-28 22:59 +0300
Message-ID<oaq6n7dvoojjhkuslkt5jrrpfgookpd53b@4ax.com>
In reply to#7796
On Tue, 27 Mar 2012 18:52:09 +0300, upsidedown@downunder.com wrote:

>On Tue, 27 Mar 2012 11:28:18 -0400, David T. Ashley
><dashley@gmail.com> wrote:
>
>>
>>Without FPU support, assuming that the processor has basic integer
>>multiplication instructions, integer operations are ALWAYS faster than
>>floating-point operations.  Usually _far_ faster.  And always more
>>precise.
>
>Floating point instructions MUL/DIV are trivial, just multiply/divide
>the mantissa and add/sub the exponent.

Assuming we are doing 64 bit double precession mul/div with an 8 bit
processor, the mantissa is 48-56 bits and hence a single cycle 8x8=16
bit multiply instruction helps a lot. In addition, the lowest part of
mantissa result (96-112 bits) is interesting only to see if this will
generate a carry to the most significant 48-56 bits. 

>With FP add/sub you have to denormalize one operand and then normalize
>the result, which can be quite time consuming, without sufficient HW
>support.

The denormalization of the smaller value can be done quite effectively
if the hardware supports shift right by N bits in a single
instruction. In fact it makes sense to first perform the right shift
by multiple of 8 bits by byte copy and then do the 1..7 bit right
shift by shift right instructions.

Unfortunately , the normalization after FP add/sub gets ugly. While
you can do the multiple of 8 shift with byte test and byte copying,
you still have to do the final left shift with a loop 1-7 times with
shift into carry and branch if carry set.

Again, if the hardware supports something like FindFirstBitSet
instruction in a single cycle, this will help the normalization a lot.

>This can be really time consuming, if the HW is designed by an idiot.

In the old days, I have seen a lot of designs, in which the designs is
made based on available gates, not by the required functionality.

[toc] | [prev] | [standalone]


Page 3 of 3 — ← Prev page 1 2 [3]

Back to top | Article view | comp.arch.embedded


csiph-web