Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #123231 > unrolled thread

NULL as the empty string

Started byjacobnavia <jacob@jacob.remcomp.fr>
First post2017-11-21 23:52 +0100
Last post2017-12-15 09:18 -0800
Articles 20 on this page of 91 — 21 participants

Back to article view | Back to comp.lang.c


Contents

  NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-21 23:52 +0100
    Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-11-21 15:16 -0800
      Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-22 00:38 +0100
        Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-11-21 16:02 -0800
          Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-22 01:13 +0100
            Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-11-21 16:52 -0800
        Re: NULL as the empty string Robert Wessel <robertwessel2@yahoo.com> - 2017-11-21 18:09 -0600
        Re: NULL as the empty string Siri Cruise <chine.bleu@yahoo.com> - 2017-11-21 16:34 -0800
        Re: NULL as the empty string David Brown <david.brown@hesbynett.no> - 2017-11-22 12:12 +0100
      Re: NULL as the empty string supercat@casperkitty.com - 2017-11-21 15:57 -0800
        Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-22 01:06 +0100
          Re: NULL as the empty string supercat@casperkitty.com - 2017-11-22 15:42 -0800
            Re: NULL as the empty string Melzzzzz <Melzzzzz@zzzzz.com> - 2017-11-22 23:49 +0000
              Re: NULL as the empty string supercat@casperkitty.com - 2017-11-22 15:56 -0800
                Re: NULL as the empty string Melzzzzz <Melzzzzz@zzzzz.com> - 2017-11-23 00:06 +0000
                  Re: NULL as the empty string supercat@casperkitty.com - 2017-11-23 17:31 -0800
                    Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-24 09:42 +0100
                      Re: NULL as the empty string supercat@casperkitty.com - 2017-11-24 13:47 -0800
      Re: NULL as the empty string Jorgen Grahn <grahn+nntp@snipabacken.se> - 2017-11-22 06:46 +0000
      Re: NULL as the empty string John Bode <jfbode1029@gmail.com> - 2017-12-08 10:27 -0800
        Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-08 11:11 -0800
          Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-08 21:39 +0100
            Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-08 13:03 -0800
              Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-08 22:50 +0100
                Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-08 15:19 -0800
                  Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-09 00:35 +0100
                    Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-08 16:05 -0800
                      Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-09 01:22 +0100
                        Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-08 17:39 -0800
                        Re: NULL as the empty string John Bode <jfbode1029@gmail.com> - 2017-12-11 12:22 -0800
                      Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-09 01:29 +0100
                        Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-08 17:47 -0800
                          Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-09 07:05 +0100
                            Re: NULL as the empty string David Brown <david.brown@hesbynett.no> - 2017-12-09 18:37 +0100
                            Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-09 11:53 -0800
                              Re: NULL as the empty string supercat@casperkitty.com - 2017-12-12 10:49 -0800
                                Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-12 13:39 -0800
                                  Re: NULL as the empty string supercat@casperkitty.com - 2017-12-12 16:05 -0800
                                    Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-13 03:43 -0800
                                      Re: NULL as the empty string supercat@casperkitty.com - 2017-12-13 08:45 -0800
                                      Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-13 09:12 -0800
                                        Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-13 13:27 -0800
                                          Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-13 14:02 -0800
                                            Re: NULL as the empty string asetofsymbols@gmail.com - 2017-12-13 14:58 -0800
                                              Re: NULL as the empty string asetofsymbols@gmail.com - 2017-12-13 15:11 -0800
                                            Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-14 03:49 -0800
                                              Re: NULL as the empty string mark.bluemel@gmail.com - 2017-12-14 04:05 -0800
                                              Re: NULL as the empty string David Brown <david.brown@hesbynett.no> - 2017-12-14 13:09 +0100
                                                Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-14 05:02 -0800
                                                  Re: NULL as the empty string David Brown <david.brown@hesbynett.no> - 2017-12-14 14:54 +0100
                                                Re: NULL as the empty string supercat@casperkitty.com - 2017-12-14 07:38 -0800
                                                  Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-14 09:50 -0800
                                              Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-14 09:20 -0800
                                                Re: NULL as the empty string supercat@casperkitty.com - 2017-12-14 09:53 -0800
                                                  Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-14 12:57 -0800
                                        Re: NULL as the empty string herrmannsfeldt@gmail.com - 2017-12-14 17:22 -0800
                                          Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-14 17:26 -0800
        Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-08 21:23 +0100
          Re: NULL as the empty string Malcolm McLean <malcolm.arthur.mclean@gmail.com> - 2017-12-08 13:41 -0800
            Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-08 22:54 +0100
    Re: NULL as the empty string supercat@casperkitty.com - 2017-11-21 15:17 -0800
      Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-22 00:26 +0100
        Re: NULL as the empty string supercat@casperkitty.com - 2017-11-21 16:03 -0800
    Re: NULL as the empty string "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-11-22 00:27 +0100
      Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-22 00:42 +0100
        Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-11-21 16:05 -0800
          Re: NULL as the empty string herrmannsfeldt@gmail.com - 2017-12-06 22:33 -0800
            Re: NULL as the empty string supercat@casperkitty.com - 2017-12-07 12:04 -0800
              Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-07 23:20 +0100
                Re: NULL as the empty string supercat@casperkitty.com - 2017-12-07 15:04 -0800
    Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-11-21 15:28 -0800
    Re: NULL as the empty string Thiago Adams <thiago.adams@gmail.com> - 2017-11-21 16:04 -0800
    Re: NULL as the empty string Siri Cruise <chine.bleu@yahoo.com> - 2017-11-21 16:25 -0800
      Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-11-22 01:34 +0100
    Re: NULL as the empty string bartc <bc@freeuk.com> - 2017-11-22 00:36 +0000
    Re: NULL as the empty string Öö Tiib <ootiib@hot.ee> - 2017-11-21 23:07 -0800
    NULL as the empty string asetofsymbols@gmail.com - 2017-11-23 22:23 -0800
    Re: NULL as the empty string Geoff <geoff@invalid.invalid> - 2017-12-09 09:05 -0800
      Re: NULL as the empty string Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2017-12-09 12:40 -0500
    Re: NULL as the empty string gordonb.yj0bc@burditt.org (Gordon Burditt) - 2017-12-09 13:50 -0600
    Re: NULL as the empty string Ian Collins <ian-news@hotmail.com> - 2017-12-10 08:59 +1300
      Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-09 12:22 -0800
        Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-11 01:42 +0100
          Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-10 19:20 -0800
            Re: NULL as the empty string jacobnavia <jacob@jacob.remcomp.fr> - 2017-12-11 18:56 +0100
              Re: NULL as the empty string Keith Thompson <kst-u@mib.org> - 2017-12-11 11:19 -0800
          Re: NULL as the empty string supercat@casperkitty.com - 2017-12-15 09:29 -0800
            Re: NULL as the empty string Thiago Adams <thiago.adams@gmail.com> - 2018-01-05 08:28 -0800
              Re: NULL as the empty string supercat@casperkitty.com - 2018-01-05 09:37 -0800
                Re: NULL as the empty string Thiago Adams <thiago.adams@gmail.com> - 2018-01-05 17:08 -0800
      Re: NULL as the empty string supercat@casperkitty.com - 2017-12-15 09:18 -0800

Page 3 of 5 — ← Prev page 1 2 [3] 4 5  Next page →


#124293

FromKeith Thompson <kst-u@mib.org>
Date2017-12-13 09:12 -0800
Message-ID<lnh8suo8kl.fsf@kst-u.example.com>
In reply to#124277
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
[...]
> Assuming that NaN represents missing values, taking the mean makes sense.
> Simply ignore those values. Taking the sum is a slippier concept and
> you don't want it handled at hardware level or in the language definition.
> It belongs in the high-level code written by statistical people, who
> may not know much about optimising compilers.

The mean is by definition the sum divided by the count.  *If* the
specification calls for NaNs to be ignored, then they should be
ignored whether you're computing a sum or a mean.  (If all entries
are NaNs, then I suppose the sum would be 0 and the mean would
be undefined.)

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#124305

FromMalcolm McLean <malcolm.arthur.mclean@gmail.com>
Date2017-12-13 13:27 -0800
Message-ID<542d2a4c-15fb-4040-a766-4638b66c7ea4@googlegroups.com>
In reply to#124293
On Wednesday, December 13, 2017 at 5:13:10 PM UTC, Keith Thompson wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> [...]
> > Assuming that NaN represents missing values, taking the mean makes sense.
> > Simply ignore those values. Taking the sum is a slippier concept and
> > you don't want it handled at hardware level or in the language definition.
> > It belongs in the high-level code written by statistical people, who
> > may not know much about optimising compilers.
> 
> The mean is by definition the sum divided by the count.  *If* the
> specification calls for NaNs to be ignored, then they should be
> ignored whether you're computing a sum or a mean.  (If all entries
> are NaNs, then I suppose the sum would be 0 and the mean would
> be undefined.)
> 
For "sum", it makes more sense to add the mean rather than 0 if NaNs
represent missing values. 

[toc] | [prev] | [next] | [standalone]


#124306

FromKeith Thompson <kst-u@mib.org>
Date2017-12-13 14:02 -0800
Message-ID<ln8te6nv5z.fsf@kst-u.example.com>
In reply to#124305
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> On Wednesday, December 13, 2017 at 5:13:10 PM UTC, Keith Thompson wrote:
>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>> [...]
>> > Assuming that NaN represents missing values, taking the mean makes sense.
>> > Simply ignore those values. Taking the sum is a slippier concept and
>> > you don't want it handled at hardware level or in the language definition.
>> > It belongs in the high-level code written by statistical people, who
>> > may not know much about optimising compilers.
>> 
>> The mean is by definition the sum divided by the count.  *If* the
>> specification calls for NaNs to be ignored, then they should be
>> ignored whether you're computing a sum or a mean.  (If all entries
>> are NaNs, then I suppose the sum would be 0 and the mean would
>> be undefined.)
>> 
> For "sum", it makes more sense to add the mean rather than 0 if NaNs
> represent missing values. 

So the sum of (2, NaN, 4) should be 9?

If that's what the specification calls for fine, but I can't imagine any
reason to assume it.  The whole idea of a NaN is that it propagates
through calculations.  2+NaN+4 should be Nan -- or *maybe* 6 if you
decide to ignore NaNs.

For that matter, the "+" operator computes the some of (a sequence of)
two values.  Should 2+NaN yield 2?

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#124308

Fromasetofsymbols@gmail.com
Date2017-12-13 14:58 -0800
Message-ID<cdd2bd75-4ec4-4f88-af23-aecef44b77b8@googlegroups.com>
In reply to#124306
b=2+sqrt(c*d/e)-3
If c*d/e=NAN or infinite (one error symbol)
b has to be NAN, the error has to propagate until the result is print so one see there is some error the time print the result of b, or some number build on b.

[toc] | [prev] | [next] | [standalone]


#124309

Fromasetofsymbols@gmail.com
Date2017-12-13 15:11 -0800
Message-ID<6c435236-529b-4faa-bf25-5c417b56f240@googlegroups.com>
In reply to#124308
This can be ok in detect overflow too for unsigned if one value of the range of unsigned is reserved for error (example -1== 0xFFFFFFFF) the error 0xFFFFFFFF can be show in the end result, because it propagated thru formulas 
For example
x=b*c+d
and b*c overflow unsigned=> x==-1

 unsigned x=NAN => x==-1

[toc] | [prev] | [next] | [standalone]


#124315

FromMalcolm McLean <malcolm.arthur.mclean@gmail.com>
Date2017-12-14 03:49 -0800
Message-ID<cf245e3f-5d19-468e-9941-39e28e0a8810@googlegroups.com>
In reply to#124306
On Wednesday, December 13, 2017 at 10:02:52 PM UTC, Keith Thompson wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> > On Wednesday, December 13, 2017 at 5:13:10 PM UTC, Keith Thompson wrote:
> >> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> >> [...]
> >> > Assuming that NaN represents missing values, taking the mean makes sense.
> >> > Simply ignore those values. Taking the sum is a slippier concept and
> >> > you don't want it handled at hardware level or in the language definition.
> >> > It belongs in the high-level code written by statistical people, who
> >> > may not know much about optimising compilers.
> >> 
> >> The mean is by definition the sum divided by the count.  *If* the
> >> specification calls for NaNs to be ignored, then they should be
> >> ignored whether you're computing a sum or a mean.  (If all entries
> >> are NaNs, then I suppose the sum would be 0 and the mean would
> >> be undefined.)
> >> 
> > For "sum", it makes more sense to add the mean rather than 0 if NaNs
> > represent missing values. 
> 
> So the sum of (2, NaN, 4) should be 9?
> 
> If that's what the specification calls for fine, but I can't imagine any
> reason to assume it.  The whole idea of a NaN is that it propagates
> through calculations.  2+NaN+4 should be Nan -- or *maybe* 6 if you
> decide to ignore NaNs.
>
There's an argument for NaN because NaN propagates. But that's not very
helpful for most real datasets. If NaN represents missing data, and the
missing data is drawn at random from the same population as the 
present data, 9 is our best estimate of what the sum of three values
will be. 
> 
> For that matter, the "+" operator computes the some of (a sequence of)
> two values.  Should 2+NaN yield 2?
> 
It just depends. Sometimes yes, for example we are counting things and
we want to know how many we have definitely identified. If we're
taxing apples, and we don't have any data for farmer Giles' apple
orchard, but we do for farmer Joe, we can send a tax demand for
farmer Joe, but farmer Giles will have to be flagged up as missing
and sent a demand later when the information comes in. However if we
know the area each farmer has under cultivation, and we know that
the harvest varies from year to year but is fairly constant per tree
per year, we can obtain a much better estimate than simply assigning
the missing farmers the mean. Then maybe farmers with big farms
take longer to count their apples than farmers with small farms, so
the missing data is not in fact a random sample of the population -
you have to be very alert to that sort of thing.

The point I'm making is that this shouldn't be handled at machine
or language level. It's a much higher-level consideration than that.

[toc] | [prev] | [next] | [standalone]


#124316

Frommark.bluemel@gmail.com
Date2017-12-14 04:05 -0800
Message-ID<3408d343-aa5e-480f-a6f7-38f58cf2a57a@googlegroups.com>
In reply to#124315
On Thursday, 14 December 2017 11:49:29 UTC, Malcolm McLean  wrote:
> On Wednesday, December 13, 2017 at 10:02:52 PM UTC, Keith Thompson wrote:

> > So the sum of (2, NaN, 4) should be 9?

> If NaN represents missing data, and the
> missing data is drawn at random from the same population as the 
> present data, 9 is our best estimate of what the sum of three values
> will be. 

Heh! Not in my case, as the set should be 2 , 2.82842712 , 4
(there is a reasonable coherent explanation, which I will leave
the reader to derive).

[toc] | [prev] | [next] | [standalone]


#124318

FromDavid Brown <david.brown@hesbynett.no>
Date2017-12-14 13:09 +0100
Message-ID<p0tpkv$is3$1@dont-email.me>
In reply to#124315
On 14/12/17 12:49, Malcolm McLean wrote:
> On Wednesday, December 13, 2017 at 10:02:52 PM UTC, Keith Thompson wrote:
>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>> On Wednesday, December 13, 2017 at 5:13:10 PM UTC, Keith Thompson wrote:
>>>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>>> [...]
>>>>> Assuming that NaN represents missing values, taking the mean makes sense.
>>>>> Simply ignore those values. Taking the sum is a slippier concept and
>>>>> you don't want it handled at hardware level or in the language definition.
>>>>> It belongs in the high-level code written by statistical people, who
>>>>> may not know much about optimising compilers.
>>>>
>>>> The mean is by definition the sum divided by the count.  *If* the
>>>> specification calls for NaNs to be ignored, then they should be
>>>> ignored whether you're computing a sum or a mean.  (If all entries
>>>> are NaNs, then I suppose the sum would be 0 and the mean would
>>>> be undefined.)
>>>>
>>> For "sum", it makes more sense to add the mean rather than 0 if NaNs
>>> represent missing values. 
>>
>> So the sum of (2, NaN, 4) should be 9?
>>
>> If that's what the specification calls for fine, but I can't imagine any
>> reason to assume it.  The whole idea of a NaN is that it propagates
>> through calculations.  2+NaN+4 should be Nan -- or *maybe* 6 if you
>> decide to ignore NaNs.
>>
> There's an argument for NaN because NaN propagates. But that's not very
> helpful for most real datasets. If NaN represents missing data, and the
> missing data is drawn at random from the same population as the 
> present data, 9 is our best estimate of what the sum of three values
> will be. 

NaN is used to indicate an error - that something has gone wrong in your
calculations, data was invalid, out of bounds, etc.  There are only two
sensible options that I can see - propagate it, or use it to cause a
trap, exception, error message, early exit, etc.

It would be crazy to try to define behaviour for it like you are
suggesting - because there is /no/ correct behaviour to define.  Are you
seriously trying to tell us that you want 2 + NaN + 4 to equal 9 when
you write it like that, because you want the NaN to be the average of 2
and 4, but in the expression 2 + NaN + 4 + 9 you would want the NaN to
be 5 as the average of 2, 4 and 9, and therefore 2 + NaN + 4 is now 11?

Sometimes you want to work with datasets with potentially missing, bad,
or inaccurate data.  A single floating point number is not sufficient
there - nor is normal floating point arithmetic.  You need more
sophisticated tracking of the metadata that depends on the task in hand.

>>
>> For that matter, the "+" operator computes the some of (a sequence of)
>> two values.  Should 2+NaN yield 2?
>>
> The point I'm making is that this shouldn't be handled at machine
> or language level. It's a much higher-level consideration than that.
> 

Exactly.  Thus it makes no sense whatsoever to give a definition of how
to treat NaNs (other than as signalling an error).  It has to be handled
at a higher level - not by using the mean, or a random sample, or 0, or
whatever.

[toc] | [prev] | [next] | [standalone]


#124319

FromMalcolm McLean <malcolm.arthur.mclean@gmail.com>
Date2017-12-14 05:02 -0800
Message-ID<ba6098b6-98ed-404e-81e2-60e83a2b2be3@googlegroups.com>
In reply to#124318
On Thursday, December 14, 2017 at 12:09:11 PM UTC, David Brown wrote:
> On 14/12/17 12:49, Malcolm McLean wrote:
> >
> Sometimes you want to work with datasets with potentially missing, bad,
> or inaccurate data.  A single floating point number is not sufficient
> there - nor is normal floating point arithmetic.  You need more
> sophisticated tracking of the metadata that depends on the task in hand.
> 
For example I've got a csv file loader that contains a function

double csv_readfield(CSV *csv, int column, int row);

it returns NaN if the data is missing, which is allowed in CSV files.
Whilst at some level that's an error, quite likely caller expects
missing values.
But NaN won't always mean "missing data", it can also mean data
which is invalid in some way, e.g. someone tried to take the square root
of a negative number
>
> > The point I'm making is that this shouldn't be handled at machine
> > or language level. It's a much higher-level consideration than that.
> > 
> 
> Exactly.  Thus it makes no sense whatsoever to give a definition of how
> to treat NaNs (other than as signalling an error).  It has to be handled
> at a higher level - not by using the mean, or a random sample, or 0, or
> whatever.
>
So "sum" is a poor name for our function. Really we want sum_finite
(the sum of all finite numbers in the list), sum_estimate (replacing
NaNs with the mean as best guess), sum_arithmetical (generating a
NaN if passed NaNs or both + and - infinity). Which one we choose
depends on what the data actually means, if we just look at it as
a list of context-free real numbers there is no answer.

[toc] | [prev] | [next] | [standalone]


#124321

FromDavid Brown <david.brown@hesbynett.no>
Date2017-12-14 14:54 +0100
Message-ID<p0tvr6$r5$1@dont-email.me>
In reply to#124319
On 14/12/17 14:02, Malcolm McLean wrote:
> On Thursday, December 14, 2017 at 12:09:11 PM UTC, David Brown wrote:
>> On 14/12/17 12:49, Malcolm McLean wrote:
>>>
>> Sometimes you want to work with datasets with potentially missing, bad,
>> or inaccurate data.  A single floating point number is not sufficient
>> there - nor is normal floating point arithmetic.  You need more
>> sophisticated tracking of the metadata that depends on the task in hand.
>>
> For example I've got a csv file loader that contains a function
> 
> double csv_readfield(CSV *csv, int column, int row);
> 
> it returns NaN if the data is missing, which is allowed in CSV files.
> Whilst at some level that's an error, quite likely caller expects
> missing values.

If the caller expects missing data, then the function is badly specified
- because it has no way to return that information.  NaN is not the same
as missing data, just as missing data is not the same as 0.

> But NaN won't always mean "missing data", it can also mean data
> which is invalid in some way, e.g. someone tried to take the square root
> of a negative number

Exactly.  You clearly understand the problem - you are simply failing to
appreciate the consequences of it, or ways to avoid it.

When you have a function declared like the one above, there is no way to
indicate these different possibilities for a lack of valid data.  Thus
the higher level code that calls the function, can't make any decision
on it.  If it receives a NaN from the function, all it can do is give up
with an error "bad data".  In some circumstances, that's fine of course.
 But what it /cannot/ sensibly do is decide on some arbitrary
interpretation.  It does not know if the NaN is due to a missing number,
a syntax error, a real NaN in the CSV file, an invalid row or column, or
any one of a number of mistakes.

>>
>>> The point I'm making is that this shouldn't be handled at machine
>>> or language level. It's a much higher-level consideration than that.
>>>
>>
>> Exactly.  Thus it makes no sense whatsoever to give a definition of how
>> to treat NaNs (other than as signalling an error).  It has to be handled
>> at a higher level - not by using the mean, or a random sample, or 0, or
>> whatever.
>>
> So "sum" is a poor name for our function. Really we want sum_finite
> (the sum of all finite numbers in the list), sum_estimate (replacing
> NaNs with the mean as best guess), sum_arithmetical (generating a
> NaN if passed NaNs or both + and - infinity). Which one we choose
> depends on what the data actually means, if we just look at it as
> a list of context-free real numbers there is no answer.
> 

That would be one way to handle things.  You need to /specify/ your
functions appropriately.  You have to decide if you are going to return
generic errors, or pass back more information (like distinguishing
between forms of missing or incorrect data), or treat the missing data
in some specific way.  The one thing that never makes sense is to pick
an arbitrary method and fail to document it.


As an example of how to get things wrong, try putting a table like this
into a spreadsheet and drawing a graph:

x	y
=====|======
1	1000
2	1001
3	999
4	=1/0
5	1002
6	1001
7	1000
8	1001

Do that in LibreOffice Calc, and the graph will scale the y axis to
around 1000, and show a pair of broken lines in full detail.  Do it in
MS Excel, and the program will treat the bad value as 0 and give you a
useless graph scaled 0 to around 1000 on the y axis.

[toc] | [prev] | [next] | [standalone]


#124324

Fromsupercat@casperkitty.com
Date2017-12-14 07:38 -0800
Message-ID<2f3ef77f-7dcf-497b-8137-22e4e32838d2@googlegroups.com>
In reply to#124318
On Thursday, December 14, 2017 at 6:09:11 AM UTC-6, David Brown wrote:
> On 14/12/17 12:49, Malcolm McLean wrote:
> > There's an argument for NaN because NaN propagates. But that's not very
> > helpful for most real datasets. If NaN represents missing data, and the
> > missing data is drawn at random from the same population as the 
> > present data, 9 is our best estimate of what the sum of three values
> > will be. 
> 
> NaN is used to indicate an error - that something has gone wrong in your
> calculations, data was invalid, out of bounds, etc.  There are only two
> sensible options that I can see - propagate it, or use it to cause a
> trap, exception, error message, early exit, etc.

If one is expecting that data will be complete, the fact that part of it
isn't shows a problem.  If, however, one expects that data might not be
complete but wants to process what one can, being able to handle those
parts that are complete and effectively ignore the ones that aren't may be
useful.

If one has asked everyone to send in a report of how many apples they will
be able to supply, some people may report that they can't send in any, while
others might simply not answer.  Distinguishing "replied zero" from "didn't
reply" will be useful for some purposes, but if one wants a lower bound on
the number of apples to expect, a total that assumes people who didn't reply
won't send any may be more useful than simply saying NaN unless or until all
replies are received.

[toc] | [prev] | [next] | [standalone]


#124329

FromMalcolm McLean <malcolm.arthur.mclean@gmail.com>
Date2017-12-14 09:50 -0800
Message-ID<94bef1d8-c7ea-4068-a176-3955dddc0619@googlegroups.com>
In reply to#124324
On Thursday, December 14, 2017 at 3:38:32 PM UTC, supe...@casperkitty.com wrote:
> On Thursday, December 14, 2017 at 6:09:11 AM UTC-6, David Brown wrote:
> > On 14/12/17 12:49, Malcolm McLean wrote:
> > > There's an argument for NaN because NaN propagates. But that's not very
> > > helpful for most real datasets. If NaN represents missing data, and the
> > > missing data is drawn at random from the same population as the 
> > > present data, 9 is our best estimate of what the sum of three values
> > > will be. 
> > 
> > NaN is used to indicate an error - that something has gone wrong in your
> > calculations, data was invalid, out of bounds, etc.  There are only two
> > sensible options that I can see - propagate it, or use it to cause a
> > trap, exception, error message, early exit, etc.
> 
> If one is expecting that data will be complete, the fact that part of it
> isn't shows a problem.  If, however, one expects that data might not be
> complete but wants to process what one can, being able to handle those
> parts that are complete and effectively ignore the ones that aren't may be
> useful.
> 
> If one has asked everyone to send in a report of how many apples they will
> be able to supply, some people may report that they can't send in any, while
> others might simply not answer.  Distinguishing "replied zero" from "didn't
> reply" will be useful for some purposes, but if one wants a lower bound on
> the number of apples to expect, a total that assumes people who didn't reply
> won't send any may be more useful than simply saying NaN unless or until all
> replies are received.
>
Yes, "we didn't harvest any apples at all this year due to a fungal
disease" is quite different from "I don't know how many apples we will
harvest because the season is late this year".

[toc] | [prev] | [next] | [standalone]


#124328

FromKeith Thompson <kst-u@mib.org>
Date2017-12-14 09:20 -0800
Message-ID<ln4lotns58.fsf@kst-u.example.com>
In reply to#124315
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> On Wednesday, December 13, 2017 at 10:02:52 PM UTC, Keith Thompson wrote:
>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
[...]
>> > For "sum", it makes more sense to add the mean rather than 0 if NaNs
>> > represent missing values. 
>> 
>> So the sum of (2, NaN, 4) should be 9?
>> 
>> If that's what the specification calls for fine, but I can't imagine any
>> reason to assume it.  The whole idea of a NaN is that it propagates
>> through calculations.  2+NaN+4 should be Nan -- or *maybe* 6 if you
>> decide to ignore NaNs.
>>
> There's an argument for NaN because NaN propagates. But that's not very
> helpful for most real datasets. If NaN represents missing data, and the
> missing data is drawn at random from the same population as the 
> present data, 9 is our best estimate of what the sum of three values
> will be. 

As I said, that's fine if that's what the specification calls for.
I'd never consider returning a "sum" of 9 (which is a guess relying
on speculation about what the NaN entry means) without an explicit
specification.

[...]

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#124330

Fromsupercat@casperkitty.com
Date2017-12-14 09:53 -0800
Message-ID<1ba8a8bf-208e-4fb6-9638-6d450a4681e3@googlegroups.com>
In reply to#124328
On Thursday, December 14, 2017 at 11:20:17 AM UTC-6, Keith Thompson wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> > On Wednesday, December 13, 2017 at 10:02:52 PM UTC, Keith Thompson wrote:
> >> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> [...]
> >> > For "sum", it makes more sense to add the mean rather than 0 if NaNs
> >> > represent missing values. 
> >> 
> >> So the sum of (2, NaN, 4) should be 9?
> >> 
> >> If that's what the specification calls for fine, but I can't imagine any
> >> reason to assume it.  The whole idea of a NaN is that it propagates
> >> through calculations.  2+NaN+4 should be Nan -- or *maybe* 6 if you
> >> decide to ignore NaNs.
> >>
> > There's an argument for NaN because NaN propagates. But that's not very
> > helpful for most real datasets. If NaN represents missing data, and the
> > missing data is drawn at random from the same population as the 
> > present data, 9 is our best estimate of what the sum of three values
> > will be. 
> 
> As I said, that's fine if that's what the specification calls for.
> I'd never consider returning a "sum" of 9 (which is a guess relying
> on speculation about what the NaN entry means) without an explicit
> specification.

Naturally, a function to compute a sum while ignoring NaN values should
document that behavior.  The choice to use such a function rather than one
which would indicate the lack of complete data should be made by the high-
level application, but the choice of what functions a language or framework
should provide must be made at the language or framework level.  The decision
to include or omit a function with given semantics should be based on how
often such semantics would be useful, and not be affected by how many
situations would exist where they would be inappropriate.  If there are
many cases where other semantics would be more helpful, that would imply
that a function should also exist with those semantics.  If two different
kinds of semantics would each be useful in many cases, a good language or
framework should provide both.

[toc] | [prev] | [next] | [standalone]


#124337

FromMalcolm McLean <malcolm.arthur.mclean@gmail.com>
Date2017-12-14 12:57 -0800
Message-ID<a397bb27-36a7-4bc1-bb8a-70f4a131b74d@googlegroups.com>
In reply to#124330
On Thursday, December 14, 2017 at 5:53:26 PM UTC, supe...@casperkitty.com wrote:

> Naturally, a function to compute a sum while ignoring NaN values should
> document that behavior.  The choice to use such a function rather than one
> which would indicate the lack of complete data should be made by the high-
> level application, but the choice of what functions a language or framework
> should provide must be made at the language or framework level.  The decision
> to include or omit a function with given semantics should be based on how
> often such semantics would be useful, and not be affected by how many
> situations would exist where they would be inappropriate.  If there are
> many cases where other semantics would be more helpful, that would imply
> that a function should also exist with those semantics.  If two different
> kinds of semantics would each be useful in many cases, a good language or
> framework should provide both.
> 
A naive "sum" function is simply an addition loop.
So what happens when passed a NaN? To estimate from the mean implies
that the function takes two passes over the data, which is too hard
for a compiler to implement. So the choices are to treat NaN as
zero, or to propagate NaNs. Neither is ideal but propagation of NaNs
at least signals an error condition.

[toc] | [prev] | [next] | [standalone]


#124343

Fromherrmannsfeldt@gmail.com
Date2017-12-14 17:22 -0800
Message-ID<c34a8ea0-661e-4834-996c-8b109d09b6c7@googlegroups.com>
In reply to#124293
On Wednesday, December 13, 2017 at 9:13:10 AM UTC-8, Keith Thompson wrote:

(snip)

> The mean is by definition the sum divided by the count.  *If* the
> specification calls for NaNs to be ignored, then they should be
> ignored whether you're computing a sum or a mean.  (If all entries
> are NaNs, then I suppose the sum would be 0 and the mean would
> be undefined.)

And the mean of zero values is?

[toc] | [prev] | [next] | [standalone]


#124344

FromKeith Thompson <kst-u@mib.org>
Date2017-12-14 17:26 -0800
Message-ID<lnshccn5na.fsf@kst-u.example.com>
In reply to#124343
herrmannsfeldt@gmail.com writes:
> On Wednesday, December 13, 2017 at 9:13:10 AM UTC-8, Keith Thompson wrote:
>
> (snip)
>
>> The mean is by definition the sum divided by the count.  *If* the
>> specification calls for NaNs to be ignored, then they should be
>> ignored whether you're computing a sum or a mean.  (If all entries
>> are NaNs, then I suppose the sum would be 0 and the mean would
>> be undefined.)
>
> And the mean of zero values is?

Undefined, unless you're working with a specification that says
otherwise.  (Were you expecting something else?)

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#124022

Fromjacobnavia <jacob@jacob.remcomp.fr>
Date2017-12-08 21:23 +0100
Message-ID<p0esck$n2o$1@dont-email.me>
In reply to#124016
Le 08/12/2017 à 19:27, John Bode a écrit :
> On Tuesday, November 21, 2017 at 5:16:29 PM UTC-6, Keith Thompson wrote:
>> jacobnavia <jacob@jacob.remcomp.fr> writes:
>>> Whaat would happen if we decide to give meaning to NULL?
>>
>> NULL (more precisely a null pointer) has a meaning.  It's a pointer
>> value that doesn't point to anything.
>>
> 
> I think I get what Jacob is getting at - instead of NULL being a macro that expands
> to a 0, have it be a special keyword with special, context-dependent semantics.  That is,
> a call against strlen or strcmp with NULL would be interpreted differently by the compiler,
> which would emit code that immediately returned a 0 or false result, without having to
> actually evaluate anything, or store an actual empty string or pointer to an empty
> string.
> 
> If that's what Jacob actually means, well, it feels to me like a solution to a problem that
> isn't really a problem.  It's a micro-optimization.  Yes, if you have thousands of distinct
> pointers to thousands of distinct empty strings, it will add up.  But, having thousands of
> distinct pointers to thousands of distinct empty strings sounds like a fairly esoteric use
> case to begin with.
> 

Yes, it was rather a question of principle. NULL has many uses, and one 
of them is "absence of data". I.e. an empty string. It is empty, no data 
is there.

NULL is different from any other string since it is empty.

This is a micro-optimization obviously, and with the GB of RAM around, 
it would take a gargantuan number of "\0" to make any difference.

[toc] | [prev] | [next] | [standalone]


#124025

FromMalcolm McLean <malcolm.arthur.mclean@gmail.com>
Date2017-12-08 13:41 -0800
Message-ID<a47d8a8f-b6e7-428c-8866-084e4c8f98d8@googlegroups.com>
In reply to#124022
On Friday, December 8, 2017 at 8:24:06 PM UTC, jacobnavia wrote:
> 
> Yes, it was rather a question of principle. NULL has many uses, and one 
> of them is "absence of data". I.e. an empty string. It is empty, no data 
> is there.
> 
Another meaning is "invalid pointer". You lost that meaning for null char *s
if you say that a null char * is the same as the empty string. Callee
is of course still free to interpret an invalid pointer as the empty string
if it makes sense, but it needs a code patch.

Formations such as strcpy(NULL, NULL) should probably be allowed and
be defined as No-ops. But that's a strcpy interface question.

[toc] | [prev] | [next] | [standalone]


#124027

Fromjacobnavia <jacob@jacob.remcomp.fr>
Date2017-12-08 22:54 +0100
Message-ID<p0f1mt$ubd$1@dont-email.me>
In reply to#124025
Le 08/12/2017 à 22:41, Malcolm McLean a écrit :
> Formations such as strcpy(NULL, NULL) should probably be allowed and
> be defined as No-ops. But that's a strcpy interface question.

Truth table:

strcpy(str,str1) --> the same as now

strcpy(NULL,str) --> Crash. The rest of the code assumes it is working 
on a copy but no space can be obtained by strcpy, so a crash is needed.

strcpy(str,NULL) --> Sets the first byte of str to zero.

strcpy(NULL,NULL) --> No-Op.

[toc] | [prev] | [next] | [standalone]


Page 3 of 5 — ← Prev page 1 2 [3] 4 5  Next page →

Back to top | Article view | comp.lang.c


csiph-web