Groups > comp.lang.python > #74283 > unrolled thread

Re: NaN comparisons - Call For Anecdotes

Started by	"Anders J. Munch" <2014@jmunch.dk>
First post	2014-07-10 01:03 +0200
Last post	2014-07-19 05:49 +1000
Articles	13 — 6 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: NaN comparisons - Call For Anecdotes "Anders J. Munch" <2014@jmunch.dk> - 2014-07-10 01:03 +0200
    Re: NaN comparisons - Call For Anecdotes Johann Hibschman <jhibschman@gmail.com> - 2014-07-17 11:12 -0400
      Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-18 01:36 +1000
        Re: NaN comparisons - Call For Anecdotes Johann Hibschman <jhibschman@gmail.com> - 2014-07-17 14:49 -0400
          Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-18 04:55 +1000
            Re: NaN comparisons - Call For Anecdotes Marko Rauhamaa <marko@pacujo.net> - 2014-07-17 22:10 +0300
              Re: NaN comparisons - Call For Anecdotes Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-17 14:39 -0600
                Re: NaN comparisons - Call For Anecdotes Marko Rauhamaa <marko@pacujo.net> - 2014-07-18 00:08 +0300
                  Re: NaN comparisons - Call For Anecdotes Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-17 17:00 -0600
                  Re: NaN comparisons - Call For Anecdotes Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-17 17:07 -0600
          Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-18 04:59 +1000
        Re: NaN comparisons - Call For Anecdotes Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-07-18 17:57 +0000
          Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-19 05:49 +1000

#74283 — Re: NaN comparisons - Call For Anecdotes

From	"Anders J. Munch" <2014@jmunch.dk>
Date	2014-07-10 01:03 +0200
Subject	Re: NaN comparisons - Call For Anecdotes
Message-ID	<mailman.11716.1404946997.18130.python-list@python.org>

Joel Goldstick wrote:
> I've been following along here, and it seems you haven't received the answer 
> you want or need.

So far I received exactly the answer I was expecting.  0 examples of NaN!=NaN 
being beneficial.
I wasn't asking for help, I was making a point.  Whether that will lead to 
improvement of Python, well, I'm not too optimistic, but I feel the point was 
worth making regardless.

regards, Anders

[toc] | [next] | [standalone]

#74642

From	Johann Hibschman <jhibschman@gmail.com>
Date	2014-07-17 11:12 -0400
Message-ID	<osx38e010kj.fsf@gmail.com>
In reply to	#74283

"Anders J. Munch" <2014@jmunch.dk> writes:
> So far I received exactly the answer I was expecting.  0 examples of
> NaN!=NaN being beneficial.
> I wasn't asking for help, I was making a point.  Whether that will
> lead to improvement of Python, well, I'm not too optimistic, but I
> feel the point was worth making regardless.

Well, I just spotted this thread.  An easy example is, well, pretty much
any case where SQL NULL would be useful.  Say I have lists of borrowers,
the amount owed, and the amount they paid so far.

    nan = float("nan")
    borrowers = ["Alice", "Bob", "Clem", "Dan"]
    amount_owed = [100.0, nan, 200.0, 300.0]
    amount_paid = [100.0, nan, nan, 200.0]
    who_paid_off = [b for (b, ao, ap) in
                          zip(borrowers, amount_owed, amount_paid)
                      if ao == ap]

I want to just get Alice from that list, not Bob.  I don't know how much
Bow owes or how much he's paid, so I certainly don't know that he's paid
off his loan.

Cheers,
Johann

[toc] | [prev] | [next] | [standalone]

#74644

From	Chris Angelico <rosuav@gmail.com>
Date	2014-07-18 01:36 +1000
Message-ID	<mailman.11929.1405611387.18130.python-list@python.org>
In reply to	#74642

On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibschman@gmail.com> wrote:
> Well, I just spotted this thread.  An easy example is, well, pretty much
> any case where SQL NULL would be useful.  Say I have lists of borrowers,
> the amount owed, and the amount they paid so far.
>
>     nan = float("nan")
>     borrowers = ["Alice", "Bob", "Clem", "Dan"]
>     amount_owed = [100.0, nan, 200.0, 300.0]
>     amount_paid = [100.0, nan, nan, 200.0]
>     who_paid_off = [b for (b, ao, ap) in
>                           zip(borrowers, amount_owed, amount_paid)
>                       if ao == ap]
>
> I want to just get Alice from that list, not Bob.  I don't know how much
> Bow owes or how much he's paid, so I certainly don't know that he's paid
> off his loan.
>

But you also don't know that he hasn't. NaN doesn't mean "unknown", it
means "Not a Number". You need a more sophisticated system that allows
for uncertainty in your data. I would advise using either None or a
dedicated singleton (something like `unknown = object()` would work,
or you could make a custom type with a more useful repr), and probably
checking for it explicitly. It's entirely possible that you do
virtually identical (or virtually converse) checks but with different
handling of unknowns - for instance, you might have one check for "who
should be sent a loan reminder letter" in which you leave out all
unknowns, and another check for "which accounts should be flagged for
human checking" in which you keep the unknowns (and maybe ignore every
loan <100.0). You have a special business case here (the need to
record information with a "maybe" state), and you need to cope with
it, which means dedicated logic and planning and design and code.

ChrisA

[toc] | [prev] | [next] | [standalone]

#74673

From	Johann Hibschman <jhibschman@gmail.com>
Date	2014-07-17 14:49 -0400
Message-ID	<osxwqbbzuqc.fsf@gmail.com>
In reply to	#74644

Chris Angelico <rosuav@gmail.com> writes:

> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data.

Regardless of whether this is the right design, it's still an example of
use.

As to the design, using NaN to implement NA is a hack with a long
history, see

      http://www.numpy.org/NA-overview.html

for some color.  Using NaN gets us a hardware-accelerated implementation
with just about the right semantics.  In a real example, these lists are
numpy arrays with tens of millions of elements, so this isn't a trivial
benefit.  (Technically, that's what's in the database; a given analysis
may look at a sample of 100k or so.)

> You have a special business case here (the need to
> record information with a "maybe" state), and you need to cope with
> it, which means dedicated logic and planning and design and code.

Yes, in principle.  In practice, everyone is used to the semantics of
R-style missing data, which are reasonably well-matched by nan.  In
principle, (NA == 1.0) should be a NA (missing) truth value, as should
(NA == NA), but in practice having it be False is more useful.  As an
example, indexing R vectors by a boolean vector containing NA yields NA
results, which is a feature that I never want.

Cheers,
Johann

[toc] | [prev] | [next] | [standalone]

#74675

From	Chris Angelico <rosuav@gmail.com>
Date	2014-07-18 04:55 +1000
Message-ID	<mailman.11950.1405623342.18130.python-list@python.org>
In reply to	#74673

On Fri, Jul 18, 2014 at 4:49 AM, Johann Hibschman <jhibschman@gmail.com> wrote:
> Chris Angelico <rosuav@gmail.com> writes:
>
>> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
>> means "Not a Number". You need a more sophisticated system that allows
>> for uncertainty in your data.
>
> Regardless of whether this is the right design, it's still an example of
> use.

Sure it is. And you may well have earned yourself that beer. But I
don't put too much stock in hacks, at least as regards design
decisions elsewhere. It's a little dubious when you grant special
meaning to things and then use that meaning to justify the things'
semantics. I'd much rather find an example where, for instance,
numerical calculations might overflow to +inf or -inf, and then
further calculations can result in a nan, etc, etc. Those are the
sorts of examples that you'd find among SciPy users and such.

ChrisA

[toc] | [prev] | [next] | [standalone]

#74679

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-07-17 22:10 +0300
Message-ID	<877g3b244v.fsf@elektro.pacujo.net>
In reply to	#74675

Chris Angelico <rosuav@gmail.com>:

> numerical calculations might overflow to +inf or -inf, and then
> further calculations can result in a nan, etc, etc. Those are the
> sorts of examples that you'd find among SciPy users and such.

There is some inconsistency.

Mathematically, there are undefined operations, for a good reason.
That's because the limits are not unambiguous and that's why 0/0, 0**0,
1/0 and inf-inf are undefined.

Why 0/0 results in an exception but inf-inf = nan, I don't see a
justification.


Marko

[toc] | [prev] | [next] | [standalone]

#74690

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-07-17 14:39 -0600
Message-ID	<mailman.11962.1405629625.18130.python-list@python.org>
In reply to	#74679

On Thu, Jul 17, 2014 at 1:10 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Mathematically, there are undefined operations, for a good reason.
> That's because the limits are not unambiguous and that's why 0/0, 0**0,
> 1/0 and inf-inf are undefined.

Well, 0**0 is usually defined as 1, despite the limits being
ambiguous.  Also, 1/0 in IEEE 754 is defined as inf.

> Why 0/0 results in an exception but inf-inf = nan, I don't see a
> justification.

I expect that float division by zero was made to raise an exception
for consistency with integer division by zero, where it might be
considered inappropriate to switch types and return inf or nan.
Granted that nowadays integer division returns a float anyway, but
there is still floor division to think about. Maybe this should have
been fixed in Python 3, but it wasn't.

[toc] | [prev] | [next] | [standalone]

#74691

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-07-18 00:08 +0300
Message-ID	<87tx6fzob1.fsf@elektro.pacujo.net>
In reply to	#74690

Ian Kelly <ian.g.kelly@gmail.com>:

> Well, 0**0 is usually defined as 1, despite the limits being
> ambiguous.

<URL: https://www.math.hmc.edu/funfacts/ffiles/10005.3-5.shtml>

    But if it could be defined, what "should" it be? 0 or 1? 


Marko

[toc] | [prev] | [next] | [standalone]

#74699

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-07-17 17:00 -0600
Message-ID	<mailman.11968.1405638089.18130.python-list@python.org>
In reply to	#74691

On Thu, Jul 17, 2014 at 3:08 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Ian Kelly <ian.g.kelly@gmail.com>:
>
>> Well, 0**0 is usually defined as 1, despite the limits being
>> ambiguous.
>
> <URL: https://www.math.hmc.edu/funfacts/ffiles/10005.3-5.shtml>
>
>     But if it could be defined, what "should" it be? 0 or 1?

I did say "usually". There's not one single Holy Keeper of the
Definitions for mathematics. Wikipedia lists some differing opinions
on the subject:

http://en.wikipedia.org/wiki/Exponentiation#Zero_to_the_power_of_zero

I note that mathworld.wolfram.com also lists it as undefined, though.

[toc] | [prev] | [next] | [standalone]

#74700

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-07-17 17:07 -0600
Message-ID	<mailman.11969.1405638500.18130.python-list@python.org>
In reply to	#74691

On Thu, Jul 17, 2014 at 5:00 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Thu, Jul 17, 2014 at 3:08 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Ian Kelly <ian.g.kelly@gmail.com>:
>>
>>> Well, 0**0 is usually defined as 1, despite the limits being
>>> ambiguous.
>>
>> <URL: https://www.math.hmc.edu/funfacts/ffiles/10005.3-5.shtml>
>>
>>     But if it could be defined, what "should" it be? 0 or 1?
>
> I did say "usually". There's not one single Holy Keeper of the
> Definitions for mathematics. Wikipedia lists some differing opinions
> on the subject:
>
> http://en.wikipedia.org/wiki/Exponentiation#Zero_to_the_power_of_zero
>
> I note that mathworld.wolfram.com also lists it as undefined, though.

Incidentally, as noted in the Wikipedia article you left out some
options -- it's not just between 0 or 1. It's also possible to derive
a limit of positive infinity or any nonnegative real.

[toc] | [prev] | [next] | [standalone]

#74676

From	Chris Angelico <rosuav@gmail.com>
Date	2014-07-18 04:59 +1000
Message-ID	<mailman.11951.1405623598.18130.python-list@python.org>
In reply to	#74673

On Fri, Jul 18, 2014 at 4:49 AM, Johann Hibschman <jhibschman@gmail.com> wrote:
> In
> principle, (NA == 1.0) should be a NA (missing) truth value, as should
> (NA == NA), but in practice having it be False is more useful.

This is actually fairly easily implemented, if you ever want it.

class NAType:
    def __repr__(self): return "NA"
    def __eq__(self, other): return self
    __lt__ = __gt__ = __le__ = __ge__ = __ne__ = __eq__
NA = NAType()

ChrisA

[toc] | [prev] | [next] | [standalone]

#74759

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-07-18 17:57 +0000
Message-ID	<53c95fef$0$9505$c3e8da3$5496439d@news.astraweb.com>
In reply to	#74644

On Fri, 18 Jul 2014 01:36:24 +1000, Chris Angelico wrote:

> On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibschman@gmail.com>
> wrote:
>> Well, I just spotted this thread.  An easy example is, well, pretty
>> much any case where SQL NULL would be useful.  Say I have lists of
>> borrowers, the amount owed, and the amount they paid so far.
>>
>>     nan = float("nan")
>>     borrowers = ["Alice", "Bob", "Clem", "Dan"] amount_owed = [100.0,
>>     nan, 200.0, 300.0] amount_paid = [100.0, nan, nan, 200.0]
>>     who_paid_off = [b for (b, ao, ap) in
>>                           zip(borrowers, amount_owed, amount_paid)
>>                       if ao == ap]
>>
>> I want to just get Alice from that list, not Bob.  I don't know how
>> much Bow owes or how much he's paid, so I certainly don't know that
>> he's paid off his loan.
>>
>>
> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data. I would advise using either None or a
> dedicated singleton (something like `unknown = object()` would work, or
> you could make a custom type with a more useful repr)

Hmmm, there's something to what you say there, but IEEE-754 NANs seem to 
have been designed to do quadruple (at least!) duty with multiple 
meanings, including:

- Missing values ("I took a reading, but I can't read my handwriting").

- Data known only qualitatively, not quantitatively (e.g. windspeed =
  "fearsome").

- Inapplicable values, e.g. the average depth of the oceans on Mars.

- The result of calculations which are mathematically indeterminate,
  such as 0/0.

- The result of real-valued calculations which are invalid due to
  domain errors, such as sqrt(-1) or acos(2.5).

- The result of calculations which are conceptually valid, but are
  unknown due to limitations of floats, e.g. you have two finite
  quantities which have both overflowed to INF, the difference
  between them ought to be finite, but there's no way to tell what
  it should be.

It seems to me that the way you treat a NAN will often depend on which 
category it falls under. E.g. when taking the average of a set of values, 
missing values ought to be skipped over, while actual indeterminate NANs 
ought to carry through:

    average([1, 1, 1, Missing, 1]) => 1
    average([1, 1, 1, 0/0, 1]) => NAN

I know that R distinguishes between NA and IEEE-754 NANs, although I'm 
not sure how complete its support for NANs is. But many (most?) R 
functions take an argument controlling whether or not to ignore NA values.

In principle, you can encode the different meanings into NANs using the 
payload. There are 9007199254740988 possible Python float NANs. Half of 
these are signalling NANs, half are quiet NANs. Ignoring the sign bit 
leaves us with 2251799813685247 distinct sNANs and the same qNANs. That's 
enough to encode a *lot* of different meanings.

[Aside: I find myself perplexed why IEEE-754 says that the sign bit of 
NANs should be ignored, but then specifies that another bit is to be used 
to distinguish signalling from quiet NANs. Why not just interpret NANs 
with the sign bit set are signalling, those with it clear are quiet?]

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#74771

From	Chris Angelico <rosuav@gmail.com>
Date	2014-07-19 05:49 +1000
Message-ID	<mailman.12014.1405712948.18130.python-list@python.org>
In reply to	#74759

On Sat, Jul 19, 2014 at 3:57 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Hmmm, there's something to what you say there, but IEEE-754 NANs seem to
> have been designed to do quadruple (at least!) duty with multiple
> meanings, including:
>
> - Missing values ("I took a reading, but I can't read my handwriting").
>
> - Data known only qualitatively, not quantitatively (e.g. windspeed =
>   "fearsome").
>
> - Inapplicable values, e.g. the average depth of the oceans on Mars.
>
> - The result of calculations which are mathematically indeterminate,
>   such as 0/0.
>
> - The result of real-valued calculations which are invalid due to
>   domain errors, such as sqrt(-1) or acos(2.5).
>
> - The result of calculations which are conceptually valid, but are
>   unknown due to limitations of floats, e.g. you have two finite
>   quantities which have both overflowed to INF, the difference
>   between them ought to be finite, but there's no way to tell what
>   it should be.

Huh, okay. I thought the definition of NaN was based on the fourth one
(mathematically indeterminate) and then it logically accepted the
subsequent two (sqrt(-1) IMO is better handled by either a complex
number or a thrown error, but NaN does make some sense there;
definitely inf-inf => nan is as logical as 0/0 => nan). The first two
seem to be better handled by SQL's NULL value (or non-value, or
something, or maybe not something); the third is a bit trickier.
Although "the average of no values" is logically calculated as 0/0
(ergo NaN makes sense there), I would say NaN isn't really right for a
truly inapplicable value - for instance, recording the mass of a
non-physical object. In an inventory system, it's probably simplest to
use 0.0 to mean "non-physical item", but it might be worth
distinguishing between "physical item with sufficiently low mass that
it underflows our measurements" (like a single sheet of paper when
you're working with postal scales) and "non-physical item with no
meaningful mass" (like credit card fees). In that case, I'm not sure
that NaN is really appropriate to the situation, but would defer to
IEE 754 on the subject.

Obviously it's possible to abuse anything to mean anything (I do
remember using nullable fields in DB2 to mean everything from "inherit
this value from parent" to "here be magic, code will work out the real
value on the fly"), but this is a question of intent and good design.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: NaN comparisons - Call For Anecdotes

Contents

#74283 — Re: NaN comparisons - Call For Anecdotes

#74642

#74644

#74673

#74675

#74679

#74690

#74691

#74699

#74700

#74676

#74759

#74771