Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #74283 > unrolled thread
| Started by | "Anders J. Munch" <2014@jmunch.dk> |
|---|---|
| First post | 2014-07-10 01:03 +0200 |
| Last post | 2014-07-19 05:49 +1000 |
| Articles | 13 — 6 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: NaN comparisons - Call For Anecdotes "Anders J. Munch" <2014@jmunch.dk> - 2014-07-10 01:03 +0200
Re: NaN comparisons - Call For Anecdotes Johann Hibschman <jhibschman@gmail.com> - 2014-07-17 11:12 -0400
Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-18 01:36 +1000
Re: NaN comparisons - Call For Anecdotes Johann Hibschman <jhibschman@gmail.com> - 2014-07-17 14:49 -0400
Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-18 04:55 +1000
Re: NaN comparisons - Call For Anecdotes Marko Rauhamaa <marko@pacujo.net> - 2014-07-17 22:10 +0300
Re: NaN comparisons - Call For Anecdotes Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-17 14:39 -0600
Re: NaN comparisons - Call For Anecdotes Marko Rauhamaa <marko@pacujo.net> - 2014-07-18 00:08 +0300
Re: NaN comparisons - Call For Anecdotes Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-17 17:00 -0600
Re: NaN comparisons - Call For Anecdotes Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-17 17:07 -0600
Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-18 04:59 +1000
Re: NaN comparisons - Call For Anecdotes Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-07-18 17:57 +0000
Re: NaN comparisons - Call For Anecdotes Chris Angelico <rosuav@gmail.com> - 2014-07-19 05:49 +1000
| From | "Anders J. Munch" <2014@jmunch.dk> |
|---|---|
| Date | 2014-07-10 01:03 +0200 |
| Subject | Re: NaN comparisons - Call For Anecdotes |
| Message-ID | <mailman.11716.1404946997.18130.python-list@python.org> |
Joel Goldstick wrote: > I've been following along here, and it seems you haven't received the answer > you want or need. So far I received exactly the answer I was expecting. 0 examples of NaN!=NaN being beneficial. I wasn't asking for help, I was making a point. Whether that will lead to improvement of Python, well, I'm not too optimistic, but I feel the point was worth making regardless. regards, Anders
[toc] | [next] | [standalone]
| From | Johann Hibschman <jhibschman@gmail.com> |
|---|---|
| Date | 2014-07-17 11:12 -0400 |
| Message-ID | <osx38e010kj.fsf@gmail.com> |
| In reply to | #74283 |
"Anders J. Munch" <2014@jmunch.dk> writes:
> So far I received exactly the answer I was expecting. 0 examples of
> NaN!=NaN being beneficial.
> I wasn't asking for help, I was making a point. Whether that will
> lead to improvement of Python, well, I'm not too optimistic, but I
> feel the point was worth making regardless.
Well, I just spotted this thread. An easy example is, well, pretty much
any case where SQL NULL would be useful. Say I have lists of borrowers,
the amount owed, and the amount they paid so far.
nan = float("nan")
borrowers = ["Alice", "Bob", "Clem", "Dan"]
amount_owed = [100.0, nan, 200.0, 300.0]
amount_paid = [100.0, nan, nan, 200.0]
who_paid_off = [b for (b, ao, ap) in
zip(borrowers, amount_owed, amount_paid)
if ao == ap]
I want to just get Alice from that list, not Bob. I don't know how much
Bow owes or how much he's paid, so I certainly don't know that he's paid
off his loan.
Cheers,
Johann
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-18 01:36 +1000 |
| Message-ID | <mailman.11929.1405611387.18130.python-list@python.org> |
| In reply to | #74642 |
On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibschman@gmail.com> wrote:
> Well, I just spotted this thread. An easy example is, well, pretty much
> any case where SQL NULL would be useful. Say I have lists of borrowers,
> the amount owed, and the amount they paid so far.
>
> nan = float("nan")
> borrowers = ["Alice", "Bob", "Clem", "Dan"]
> amount_owed = [100.0, nan, 200.0, 300.0]
> amount_paid = [100.0, nan, nan, 200.0]
> who_paid_off = [b for (b, ao, ap) in
> zip(borrowers, amount_owed, amount_paid)
> if ao == ap]
>
> I want to just get Alice from that list, not Bob. I don't know how much
> Bow owes or how much he's paid, so I certainly don't know that he's paid
> off his loan.
>
But you also don't know that he hasn't. NaN doesn't mean "unknown", it
means "Not a Number". You need a more sophisticated system that allows
for uncertainty in your data. I would advise using either None or a
dedicated singleton (something like `unknown = object()` would work,
or you could make a custom type with a more useful repr), and probably
checking for it explicitly. It's entirely possible that you do
virtually identical (or virtually converse) checks but with different
handling of unknowns - for instance, you might have one check for "who
should be sent a loan reminder letter" in which you leave out all
unknowns, and another check for "which accounts should be flagged for
human checking" in which you keep the unknowns (and maybe ignore every
loan <100.0). You have a special business case here (the need to
record information with a "maybe" state), and you need to cope with
it, which means dedicated logic and planning and design and code.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Johann Hibschman <jhibschman@gmail.com> |
|---|---|
| Date | 2014-07-17 14:49 -0400 |
| Message-ID | <osxwqbbzuqc.fsf@gmail.com> |
| In reply to | #74644 |
Chris Angelico <rosuav@gmail.com> writes:
> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data.
Regardless of whether this is the right design, it's still an example of
use.
As to the design, using NaN to implement NA is a hack with a long
history, see
http://www.numpy.org/NA-overview.html
for some color. Using NaN gets us a hardware-accelerated implementation
with just about the right semantics. In a real example, these lists are
numpy arrays with tens of millions of elements, so this isn't a trivial
benefit. (Technically, that's what's in the database; a given analysis
may look at a sample of 100k or so.)
> You have a special business case here (the need to
> record information with a "maybe" state), and you need to cope with
> it, which means dedicated logic and planning and design and code.
Yes, in principle. In practice, everyone is used to the semantics of
R-style missing data, which are reasonably well-matched by nan. In
principle, (NA == 1.0) should be a NA (missing) truth value, as should
(NA == NA), but in practice having it be False is more useful. As an
example, indexing R vectors by a boolean vector containing NA yields NA
results, which is a feature that I never want.
Cheers,
Johann
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-18 04:55 +1000 |
| Message-ID | <mailman.11950.1405623342.18130.python-list@python.org> |
| In reply to | #74673 |
On Fri, Jul 18, 2014 at 4:49 AM, Johann Hibschman <jhibschman@gmail.com> wrote: > Chris Angelico <rosuav@gmail.com> writes: > >> But you also don't know that he hasn't. NaN doesn't mean "unknown", it >> means "Not a Number". You need a more sophisticated system that allows >> for uncertainty in your data. > > Regardless of whether this is the right design, it's still an example of > use. Sure it is. And you may well have earned yourself that beer. But I don't put too much stock in hacks, at least as regards design decisions elsewhere. It's a little dubious when you grant special meaning to things and then use that meaning to justify the things' semantics. I'd much rather find an example where, for instance, numerical calculations might overflow to +inf or -inf, and then further calculations can result in a nan, etc, etc. Those are the sorts of examples that you'd find among SciPy users and such. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-07-17 22:10 +0300 |
| Message-ID | <877g3b244v.fsf@elektro.pacujo.net> |
| In reply to | #74675 |
Chris Angelico <rosuav@gmail.com>: > numerical calculations might overflow to +inf or -inf, and then > further calculations can result in a nan, etc, etc. Those are the > sorts of examples that you'd find among SciPy users and such. There is some inconsistency. Mathematically, there are undefined operations, for a good reason. That's because the limits are not unambiguous and that's why 0/0, 0**0, 1/0 and inf-inf are undefined. Why 0/0 results in an exception but inf-inf = nan, I don't see a justification. Marko
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-07-17 14:39 -0600 |
| Message-ID | <mailman.11962.1405629625.18130.python-list@python.org> |
| In reply to | #74679 |
On Thu, Jul 17, 2014 at 1:10 PM, Marko Rauhamaa <marko@pacujo.net> wrote: > Mathematically, there are undefined operations, for a good reason. > That's because the limits are not unambiguous and that's why 0/0, 0**0, > 1/0 and inf-inf are undefined. Well, 0**0 is usually defined as 1, despite the limits being ambiguous. Also, 1/0 in IEEE 754 is defined as inf. > Why 0/0 results in an exception but inf-inf = nan, I don't see a > justification. I expect that float division by zero was made to raise an exception for consistency with integer division by zero, where it might be considered inappropriate to switch types and return inf or nan. Granted that nowadays integer division returns a float anyway, but there is still floor division to think about. Maybe this should have been fixed in Python 3, but it wasn't.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-07-18 00:08 +0300 |
| Message-ID | <87tx6fzob1.fsf@elektro.pacujo.net> |
| In reply to | #74690 |
Ian Kelly <ian.g.kelly@gmail.com>:
> Well, 0**0 is usually defined as 1, despite the limits being
> ambiguous.
<URL: https://www.math.hmc.edu/funfacts/ffiles/10005.3-5.shtml>
But if it could be defined, what "should" it be? 0 or 1?
Marko
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-07-17 17:00 -0600 |
| Message-ID | <mailman.11968.1405638089.18130.python-list@python.org> |
| In reply to | #74691 |
On Thu, Jul 17, 2014 at 3:08 PM, Marko Rauhamaa <marko@pacujo.net> wrote: > Ian Kelly <ian.g.kelly@gmail.com>: > >> Well, 0**0 is usually defined as 1, despite the limits being >> ambiguous. > > <URL: https://www.math.hmc.edu/funfacts/ffiles/10005.3-5.shtml> > > But if it could be defined, what "should" it be? 0 or 1? I did say "usually". There's not one single Holy Keeper of the Definitions for mathematics. Wikipedia lists some differing opinions on the subject: http://en.wikipedia.org/wiki/Exponentiation#Zero_to_the_power_of_zero I note that mathworld.wolfram.com also lists it as undefined, though.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-07-17 17:07 -0600 |
| Message-ID | <mailman.11969.1405638500.18130.python-list@python.org> |
| In reply to | #74691 |
On Thu, Jul 17, 2014 at 5:00 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > On Thu, Jul 17, 2014 at 3:08 PM, Marko Rauhamaa <marko@pacujo.net> wrote: >> Ian Kelly <ian.g.kelly@gmail.com>: >> >>> Well, 0**0 is usually defined as 1, despite the limits being >>> ambiguous. >> >> <URL: https://www.math.hmc.edu/funfacts/ffiles/10005.3-5.shtml> >> >> But if it could be defined, what "should" it be? 0 or 1? > > I did say "usually". There's not one single Holy Keeper of the > Definitions for mathematics. Wikipedia lists some differing opinions > on the subject: > > http://en.wikipedia.org/wiki/Exponentiation#Zero_to_the_power_of_zero > > I note that mathworld.wolfram.com also lists it as undefined, though. Incidentally, as noted in the Wikipedia article you left out some options -- it's not just between 0 or 1. It's also possible to derive a limit of positive infinity or any nonnegative real.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-18 04:59 +1000 |
| Message-ID | <mailman.11951.1405623598.18130.python-list@python.org> |
| In reply to | #74673 |
On Fri, Jul 18, 2014 at 4:49 AM, Johann Hibschman <jhibschman@gmail.com> wrote:
> In
> principle, (NA == 1.0) should be a NA (missing) truth value, as should
> (NA == NA), but in practice having it be False is more useful.
This is actually fairly easily implemented, if you ever want it.
class NAType:
def __repr__(self): return "NA"
def __eq__(self, other): return self
__lt__ = __gt__ = __le__ = __ge__ = __ne__ = __eq__
NA = NAType()
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-07-18 17:57 +0000 |
| Message-ID | <53c95fef$0$9505$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #74644 |
On Fri, 18 Jul 2014 01:36:24 +1000, Chris Angelico wrote:
> On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibschman@gmail.com>
> wrote:
>> Well, I just spotted this thread. An easy example is, well, pretty
>> much any case where SQL NULL would be useful. Say I have lists of
>> borrowers, the amount owed, and the amount they paid so far.
>>
>> nan = float("nan")
>> borrowers = ["Alice", "Bob", "Clem", "Dan"] amount_owed = [100.0,
>> nan, 200.0, 300.0] amount_paid = [100.0, nan, nan, 200.0]
>> who_paid_off = [b for (b, ao, ap) in
>> zip(borrowers, amount_owed, amount_paid)
>> if ao == ap]
>>
>> I want to just get Alice from that list, not Bob. I don't know how
>> much Bow owes or how much he's paid, so I certainly don't know that
>> he's paid off his loan.
>>
>>
> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data. I would advise using either None or a
> dedicated singleton (something like `unknown = object()` would work, or
> you could make a custom type with a more useful repr)
Hmmm, there's something to what you say there, but IEEE-754 NANs seem to
have been designed to do quadruple (at least!) duty with multiple
meanings, including:
- Missing values ("I took a reading, but I can't read my handwriting").
- Data known only qualitatively, not quantitatively (e.g. windspeed =
"fearsome").
- Inapplicable values, e.g. the average depth of the oceans on Mars.
- The result of calculations which are mathematically indeterminate,
such as 0/0.
- The result of real-valued calculations which are invalid due to
domain errors, such as sqrt(-1) or acos(2.5).
- The result of calculations which are conceptually valid, but are
unknown due to limitations of floats, e.g. you have two finite
quantities which have both overflowed to INF, the difference
between them ought to be finite, but there's no way to tell what
it should be.
It seems to me that the way you treat a NAN will often depend on which
category it falls under. E.g. when taking the average of a set of values,
missing values ought to be skipped over, while actual indeterminate NANs
ought to carry through:
average([1, 1, 1, Missing, 1]) => 1
average([1, 1, 1, 0/0, 1]) => NAN
I know that R distinguishes between NA and IEEE-754 NANs, although I'm
not sure how complete its support for NANs is. But many (most?) R
functions take an argument controlling whether or not to ignore NA values.
In principle, you can encode the different meanings into NANs using the
payload. There are 9007199254740988 possible Python float NANs. Half of
these are signalling NANs, half are quiet NANs. Ignoring the sign bit
leaves us with 2251799813685247 distinct sNANs and the same qNANs. That's
enough to encode a *lot* of different meanings.
[Aside: I find myself perplexed why IEEE-754 says that the sign bit of
NANs should be ignored, but then specifies that another bit is to be used
to distinguish signalling from quiet NANs. Why not just interpret NANs
with the sign bit set are signalling, those with it clear are quiet?]
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-19 05:49 +1000 |
| Message-ID | <mailman.12014.1405712948.18130.python-list@python.org> |
| In reply to | #74759 |
On Sat, Jul 19, 2014 at 3:57 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Hmmm, there's something to what you say there, but IEEE-754 NANs seem to
> have been designed to do quadruple (at least!) duty with multiple
> meanings, including:
>
> - Missing values ("I took a reading, but I can't read my handwriting").
>
> - Data known only qualitatively, not quantitatively (e.g. windspeed =
> "fearsome").
>
> - Inapplicable values, e.g. the average depth of the oceans on Mars.
>
> - The result of calculations which are mathematically indeterminate,
> such as 0/0.
>
> - The result of real-valued calculations which are invalid due to
> domain errors, such as sqrt(-1) or acos(2.5).
>
> - The result of calculations which are conceptually valid, but are
> unknown due to limitations of floats, e.g. you have two finite
> quantities which have both overflowed to INF, the difference
> between them ought to be finite, but there's no way to tell what
> it should be.
Huh, okay. I thought the definition of NaN was based on the fourth one
(mathematically indeterminate) and then it logically accepted the
subsequent two (sqrt(-1) IMO is better handled by either a complex
number or a thrown error, but NaN does make some sense there;
definitely inf-inf => nan is as logical as 0/0 => nan). The first two
seem to be better handled by SQL's NULL value (or non-value, or
something, or maybe not something); the third is a bit trickier.
Although "the average of no values" is logically calculated as 0/0
(ergo NaN makes sense there), I would say NaN isn't really right for a
truly inapplicable value - for instance, recording the mass of a
non-physical object. In an inventory system, it's probably simplest to
use 0.0 to mean "non-physical item", but it might be worth
distinguishing between "physical item with sufficiently low mass that
it underflows our measurements" (like a single sheet of paper when
you're working with postal scales) and "non-physical item with no
meaningful mass" (like credit card fees). In that case, I'm not sure
that NaN is really appropriate to the situation, but would defer to
IEE 754 on the subject.
Obviously it's possible to abuse anything to mean anything (I do
remember using nullable fields in DB2 to mean everything from "inherit
this value from parent" to "here be magic, code will work out the real
value on the fly"), but this is a question of intent and good design.
ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web