Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #36279 > unrolled thread

Numpy outlier removal

Started by"Joseph L. Casale" <jcasale@activenetwerx.com>
First post2013-01-06 19:44 +0000
Last post2013-01-07 02:12 +0000
Articles 20 on this page of 28 — 11 participants

Back to article view | Back to comp.lang.python


Contents

  Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-06 19:44 +0000
    Re: Numpy outlier removal Hans Mulder <hansmu@xs4all.nl> - 2013-01-06 23:33 +0100
      RE: Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-06 22:50 +0000
      Re: Numpy outlier removal MRAB <python@mrabarnett.plus.com> - 2013-01-06 23:18 +0000
    Re: Numpy outlier removal Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 01:46 +0000
      Re: Numpy outlier removal "Paul Simon" <psimon@sonic.net> - 2013-01-06 18:21 -0800
      Re: Numpy outlier removal Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 02:29 +0000
        Re: Numpy outlier removal Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 05:11 +0000
          Re: Numpy outlier removal Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 15:20 +0000
            [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 17:58 +0000
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-08 06:43 +1100
                Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-08 02:06 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-08 17:35 +1100
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Robert Kern <robert.kern@gmail.com> - 2013-01-08 15:55 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-09 07:14 +1100
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-09 07:50 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Robert Kern <robert.kern@gmail.com> - 2013-01-08 22:59 +0000
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 22:32 +0000
                Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-08 01:23 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Terry Reedy <tjreedy@udel.edu> - 2013-01-08 04:07 -0500
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Maarten <maarten.sneep@knmi.nl> - 2013-01-08 08:47 -0800
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Maarten <maarten.sneep@knmi.nl> - 2013-01-08 08:47 -0800
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-09 00:02 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-08 13:50 +0000
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Jason Friedman <jason@powerpull.net> - 2013-01-08 19:22 -0700
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Jason Friedman <jason@powerpull.net> - 2013-01-08 19:23 -0700
          Re: Numpy outlier removal Robert Kern <robert.kern@gmail.com> - 2013-01-07 15:35 +0000
      RE: Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-07 02:12 +0000

Page 1 of 2  [1] 2  Next page →


#36279 — Numpy outlier removal

From"Joseph L. Casale" <jcasale@activenetwerx.com>
Date2013-01-06 19:44 +0000
SubjectNumpy outlier removal
Message-ID<mailman.179.1357501521.2939.python-list@python.org>
I have a dataset that consists of a dict with text descriptions and values that are integers. If
required, I collect the values into a list and create a numpy array running it through a simple
routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
to include.


The problem is I loos track of which were removed so the original display of the dataset is
misleading when the processed average is returned as it includes the removed key/values.


Ayone know how I can maintain the relationship and when I exclude a value, remove it from
the dict?

Thanks!
jlc

[toc] | [next] | [standalone]


#36296

FromHans Mulder <hansmu@xs4all.nl>
Date2013-01-06 23:33 +0100
Message-ID<50e9fbd5$0$6848$e4fe514c@news2.news.xs4all.nl>
In reply to#36279
On 6/01/13 20:44:08, Joseph L. Casale wrote:
> I have a dataset that consists of a dict with text descriptions and values that are integers. If
> required, I collect the values into a list and create a numpy array running it through a simple
> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
> to include.
> 
> 
> The problem is I loos track of which were removed so the original display of the dataset is
> misleading when the processed average is returned as it includes the removed key/values.
> 
> 
> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
> the dict?

Assuming your data and the dictionary are keyed by a common set of keys:

for key in descriptions:
    if abs(data[key] - mean(data)) >= m * std(data):
        del data[key]
        del descriptions[key]


Hope this helps,

-- HansM

[toc] | [prev] | [next] | [standalone]


#36300

From"Joseph L. Casale" <jcasale@activenetwerx.com>
Date2013-01-06 22:50 +0000
Message-ID<mailman.194.1357512697.2939.python-list@python.org>
In reply to#36296
>Assuming your data and the dictionary are keyed by a common set of keys: 

>
>for key in descriptions:
>    if abs(data[key] - mean(data)) >= m * std(data):
>        del data[key]
>        del descriptions[key]


Heh, yeah sometimes the obvious is too simple to see. I used a dict comp to rebuild
the results with the comparison.


Thanks!
jlc

[toc] | [prev] | [next] | [standalone]


#36303

FromMRAB <python@mrabarnett.plus.com>
Date2013-01-06 23:18 +0000
Message-ID<mailman.196.1357514503.2939.python-list@python.org>
In reply to#36296
On 2013-01-06 22:33, Hans Mulder wrote:
> On 6/01/13 20:44:08, Joseph L. Casale wrote:
>> I have a dataset that consists of a dict with text descriptions and values that are integers. If
>> required, I collect the values into a list and create a numpy array running it through a simple
>> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
>> to include.
>>
>>
>> The problem is I loos track of which were removed so the original display of the dataset is
>> misleading when the processed average is returned as it includes the removed key/values.
>>
>>
>> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
>> the dict?
>
> Assuming your data and the dictionary are keyed by a common set of keys:
>
> for key in descriptions:
>      if abs(data[key] - mean(data)) >= m * std(data):
>          del data[key]
>          del descriptions[key]
>
It's generally a bad idea to modify a collection over which you're
iterating. It's better to, say, make a list of what you're going to
delete and then iterate over that list to make the deletions:

deletions = []

for key in in descriptions:
     if abs(data[key] - mean(data)) >= m * std(data):
         deletions.append(key)

for key in deletions:
     del data[key]
     del descriptions[key]

[toc] | [prev] | [next] | [standalone]


#36314

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-01-07 01:46 +0000
Message-ID<50ea28e7$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to#36279
On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:

> I have a dataset that consists of a dict with text descriptions and 
> values that are integers. If required, I collect the values into a list 
> and create a numpy array running it through a simple routine: 
> 
> data[abs(data - mean(data)) < m * std(data)] 
>
> where m is the number of std deviations to include.

I'm not sure that this approach is statistically robust. No, let me be 
even more assertive: I'm sure that this approach is NOT statistically 
robust, and may be scientifically dubious.

The above assumes your data is normally distributed. How sure are you 
that this is actually the case?

For normally distributed data:

Since both the mean and std calculations as effected by the presence of 
outliers, your test for what counts as an outlier will miss outliers for 
data from a normal distribution. For small N (sample size), it may be 
mathematically impossible for any data point to be greater than m*SD from 
the mean. For example, with N=5, no data point can be more than 1.789*SD 
from the mean. So for N=5, m=1 may throw away good data, and m=2 will 
fail to find any outliers no matter how outrageous they are.

For large N, you will expect to find significant numbers of data points 
more than m*SD from the mean. With N=100000, and m=3, you will expect to 
throw away 270 perfectly good data points simply because they are out on 
the tails of the distribution.

Worse, if the data is not in fact from a normal distribution, all bets 
are off. You may be keeping obvious outliers; or more often, your test 
will be throwing away perfectly good data that it misidentifies as 
outliers.

In other words: this approach for detecting outliers is nothing more than 
a very rough, and very bad, heuristic, and should be avoided.

Identifying outliers is fraught with problems even for experts. For 
example, the ozone hole over the Antarctic was ignored for many years 
because the software being used to analyse it misidentified the data as 
outliers.

The best general advice I have seen is:

Never automatically remove outliers except for values that are physically 
impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"), 
unless you have good, solid, physical reasons for justifying removal of 
outliers. Other than that, manually remove outliers with care, or not at 
all, and if you do so, always report your results twice, once with all 
the data, and once with supposed outliers removed.

You can read up more about outlier detection, and the difficulties 
thereof, here:

http://www.medcalc.org/manual/outliers.php

https://secure.graphpad.com/guides/prism/6/statistics/index.htm

http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html

http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#36315

From"Paul Simon" <psimon@sonic.net>
Date2013-01-06 18:21 -0800
Message-ID<50ea3199$0$80136$742ec2ed@news.sonic.net>
In reply to#36314
"Steven D'Aprano" <steve+comp.lang.python@pearwood.info> wrote in message 
news:50ea28e7$0$30003$c3e8da3$5496439d@news.astraweb.com...
> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>
>> I have a dataset that consists of a dict with text descriptions and
>> values that are integers. If required, I collect the values into a list
>> and create a numpy array running it through a simple routine:
>>
>> data[abs(data - mean(data)) < m * std(data)]
>>
>> where m is the number of std deviations to include.
>
> I'm not sure that this approach is statistically robust. No, let me be
> even more assertive: I'm sure that this approach is NOT statistically
> robust, and may be scientifically dubious.
>
> The above assumes your data is normally distributed. How sure are you
> that this is actually the case?
>
> For normally distributed data:
>
> Since both the mean and std calculations as effected by the presence of
> outliers, your test for what counts as an outlier will miss outliers for
> data from a normal distribution. For small N (sample size), it may be
> mathematically impossible for any data point to be greater than m*SD from
> the mean. For example, with N=5, no data point can be more than 1.789*SD
> from the mean. So for N=5, m=1 may throw away good data, and m=2 will
> fail to find any outliers no matter how outrageous they are.
>
> For large N, you will expect to find significant numbers of data points
> more than m*SD from the mean. With N=100000, and m=3, you will expect to
> throw away 270 perfectly good data points simply because they are out on
> the tails of the distribution.
>
> Worse, if the data is not in fact from a normal distribution, all bets
> are off. You may be keeping obvious outliers; or more often, your test
> will be throwing away perfectly good data that it misidentifies as
> outliers.
>
> In other words: this approach for detecting outliers is nothing more than
> a very rough, and very bad, heuristic, and should be avoided.
>
> Identifying outliers is fraught with problems even for experts. For
> example, the ozone hole over the Antarctic was ignored for many years
> because the software being used to analyse it misidentified the data as
> outliers.
>
> The best general advice I have seen is:
>
> Never automatically remove outliers except for values that are physically
> impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
> unless you have good, solid, physical reasons for justifying removal of
> outliers. Other than that, manually remove outliers with care, or not at
> all, and if you do so, always report your results twice, once with all
> the data, and once with supposed outliers removed.
>
> You can read up more about outlier detection, and the difficulties
> thereof, here:
>
> http://www.medcalc.org/manual/outliers.php
>
> https://secure.graphpad.com/guides/prism/6/statistics/index.htm
>
> http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html
>
> http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
>
>
>
> -- 
> Steven
If you suspect that the data may not be normal you might look at exploratory 
data analysis, see Tukey.  It's descriptive rather than analytic, treats 
outliers respectfully, uses median rather than mean, and is very visual. 
Wherever I analyzed data both gaussian and with EDA, EDA always won.

Paul 

[toc] | [prev] | [next] | [standalone]


#36316

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-01-07 02:29 +0000
Message-ID<mailman.205.1357525775.2939.python-list@python.org>
In reply to#36314
On 7 January 2013 01:46, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>
>> I have a dataset that consists of a dict with text descriptions and
>> values that are integers. If required, I collect the values into a list
>> and create a numpy array running it through a simple routine:
>>
>> data[abs(data - mean(data)) < m * std(data)]
>>
>> where m is the number of std deviations to include.
>
> I'm not sure that this approach is statistically robust. No, let me be
> even more assertive: I'm sure that this approach is NOT statistically
> robust, and may be scientifically dubious.

Whether or not this is "statistically robust" requires more
explanation about the OP's intention. Thus far, the OP has not given
any reason/motivation for excluding data or even for having any data
in the first place! It's hard to say whether any technique applied is
really accurate/robust without knowing *anything* about the purpose of
the operation.


Oscar

[toc] | [prev] | [next] | [standalone]


#36321

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-01-07 05:11 +0000
Message-ID<50ea58f0$0$21851$c3e8da3$76491128@news.astraweb.com>
In reply to#36316
On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:

> On 7 January 2013 01:46, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>
>>> I have a dataset that consists of a dict with text descriptions and
>>> values that are integers. If required, I collect the values into a
>>> list and create a numpy array running it through a simple routine:
>>>
>>> data[abs(data - mean(data)) < m * std(data)]
>>>
>>> where m is the number of std deviations to include.
>>
>> I'm not sure that this approach is statistically robust. No, let me be
>> even more assertive: I'm sure that this approach is NOT statistically
>> robust, and may be scientifically dubious.
> 
> Whether or not this is "statistically robust" requires more explanation
> about the OP's intention. 

Not really. Statistics robustness is objectively defined, and the user's 
intention doesn't come into it. The mean is not a robust measure of 
central tendency, the median is, regardless of why you pick one or the 
other.

There are sometimes good reasons for choosing non-robust statistics or 
techniques over robust ones, but some techniques are so dodgy that there 
is *never* a good reason for doing so. E.g. finding the line of best fit 
by eye, or taking more and more samples until you get a statistically 
significant result. Such techniques are not just non-robust in the 
statistical sense, but non-robust in the general sense, if not outright 
deceitful.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#36355

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-01-07 15:20 +0000
Message-ID<mailman.223.1357572059.2939.python-list@python.org>
In reply to#36321
On 7 January 2013 05:11, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
>
>> On 7 January 2013 01:46, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>>
>>> I'm not sure that this approach is statistically robust. No, let me be
>>> even more assertive: I'm sure that this approach is NOT statistically
>>> robust, and may be scientifically dubious.
>>
>> Whether or not this is "statistically robust" requires more explanation
>> about the OP's intention.
>
> Not really. Statistics robustness is objectively defined, and the user's
> intention doesn't come into it. The mean is not a robust measure of
> central tendency, the median is, regardless of why you pick one or the
> other.

Okay, I see what you mean. I wasn't thinking of robustness as a
technical term but now I see that you are correct.

Perhaps what I should have said is that whether or not this matters
depends on the problem at hand (hopefully this isn't an important
medical trial) and the particular type of data that you have; assuming
normality is fine in many cases even if the data is not "really"
normal.

>
> There are sometimes good reasons for choosing non-robust statistics or
> techniques over robust ones, but some techniques are so dodgy that there
> is *never* a good reason for doing so. E.g. finding the line of best fit
> by eye, or taking more and more samples until you get a statistically
> significant result. Such techniques are not just non-robust in the
> statistical sense, but non-robust in the general sense, if not outright
> deceitful.

There are sometimes good reasons to get a line of best fit by eye. In
particular if your data contains clusters that are hard to separate,
sometimes it's useful to just pick out roughly where you think a line
through a subset of the data is.


Oscar

[toc] | [prev] | [next] | [standalone]


#36368 — [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-01-07 17:58 +0000
Subject[Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<50eb0cd2$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to#36355
On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:

> There are sometimes good reasons to get a line of best fit by eye. In
> particular if your data contains clusters that are hard to separate,
> sometimes it's useful to just pick out roughly where you think a line
> through a subset of the data is.

Cherry picking subsets of your data as well as line fitting by eye? Two 
wrongs do not make a right.

If you're going to just invent a line based on where you think it should 
be, what do you need the data for? Just declare "this is the line I wish 
to believe in" and save yourself the time and energy of collecting the 
data in the first place. Your conclusion will be no less valid.

How do you distinguish between "data contains clusters that are hard to 
separate" from "data doesn't fit a line at all"?

Even if the data actually is linear, on what basis could we distinguish 
between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit 
by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely 
subjective judgement can be equally denied on the basis of subjective 
judgement.

Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake 
clusters in order to fool others. Here is a real world example of what 
happens when people pick out the data clusters that they like based on 
visual inspection:

http://www.skepticalscience.com/images/TempEscalator.gif

And not linear by any means, but related to the cherry picking theme:

http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif


To put it another way, when we fit patterns to data by eye, we can easily 
fool ourselves into seeing patterns that aren't there, or missing the 
patterns which are there. At best line fitting by eye is prone to honest 
errors; at worst, it is open to the most deliberate abuse. We have eyes 
and brains that evolved to spot the ripe fruit in trees, not to spot 
linear trends in noisy data, and fitting by eye is not safe or 
appropriate.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#36373 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromChris Angelico <rosuav@gmail.com>
Date2013-01-08 06:43 +1100
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.237.1357587833.2939.python-list@python.org>
In reply to#36368
On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Anyone can fool themselves into placing a line through a subset of non-
> linear data. Or, sadly more often, *deliberately* cherry picking fake
> clusters in order to fool others. Here is a real world example of what
> happens when people pick out the data clusters that they like based on
> visual inspection:
>
> http://www.skepticalscience.com/images/TempEscalator.gif

And sensible people will notice that, even drawn like that, it's only
a ~0.6 deg increase across ~30 years. Hardly statistically
significant, given that weather patterns have been known to follow
cycles at least that long. But that's nothing to do with drawing lines
through points, and more to do with how much data you collect before
you announce a conclusion, and how easily a graph can prove any point
you like.

Statistical analysis is a huge science. So is lying. And I'm not sure
most people can pick one from the other.

ChrisA

[toc] | [prev] | [next] | [standalone]


#36403 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-01-08 02:06 +0000
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<50eb7f2a$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to#36373
On Tue, 08 Jan 2013 06:43:46 +1100, Chris Angelico wrote:

> On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> Anyone can fool themselves into placing a line through a subset of non-
>> linear data. Or, sadly more often, *deliberately* cherry picking fake
>> clusters in order to fool others. Here is a real world example of what
>> happens when people pick out the data clusters that they like based on
>> visual inspection:
>>
>> http://www.skepticalscience.com/images/TempEscalator.gif
> 
> And sensible people will notice that, even drawn like that, it's only a
> ~0.6 deg increase across ~30 years. Hardly statistically significant,

Well, I don't know about "sensible people", but magnitude of an effect 
has little to do with whether or not something is statistically 
significant or not. Given noisy data, statistical significance relates to 
whether or not we can be confident that the effect is *real*, not whether 
it is a big effect or a small effect.

Here's an example: assume that you are on a fixed salary with a constant 
weekly income. If you happen to win the lottery one day, and consequently 
your income for that week quadruples, that is a large effect that fails 
to have any statistical significance -- it's a blip, not part of any long-
term change in income. You can't conclude that you'll win the lottery 
every week from now on.

On the other hand, if the government changes the rules relating to tax, 
deductions, etc., even by a small amount, your weekly income might go 
down, or up, by a single dollar. Even though that is a tiny effect, it is 
*not* a blip, and will be statistically significant. In practice, it 
takes a certain number of data points to reach that confidence level. 
Your accountant, who knows the tax laws, will conclude that the change is 
real immediately, but a statistician who sees only the pay slips may take 
some months before she is convinced that the change is signal rather than 
noise. With only three weeks pay slips in hand, the statistician cannot 
be sure that the difference is not just some accounting error or other 
fluke, but each additional data point increases the confidence that the 
difference is real and not just some temporary aberration.

The other meaning of "significant" has nothing to do with statistics, and 
everything to do with "a difference is only a difference if it makes a 
difference". 0.2° per decade doesn't sound like much, not when we 
consider daily or yearly temperatures that typically have a range of tens 
of degrees between night and day, or winter and summer. But that is 
misunderstanding the nature of long-term climate versus daily weather and 
glossing over the fact that we're only talking about an average and 
ignoring changes to the variability of the climate: a small increase in 
average can lead to a large increase in extreme events.


> given that weather patterns have been known to follow cycles at least
> that long.

That is not a given. "Weather patterns" don't last for thirty years. 
Perhaps you are talking about climate patterns? In which case, well, yes, 
we can see a very strong climate pattern of warming on a time scale of 
decades, with no evidence that it is a cycle.

There are, of course, many climate cycles that take place on a time frame 
of years or decades, such as the North Atlantic Oscillation and the El 
Nino Southern Oscillation. None of them are global, and as far as I know 
none of them are exactly periodic. They are noise in the system, and 
certainly not responsible for linear trends.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#36416 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromChris Angelico <rosuav@gmail.com>
Date2013-01-08 17:35 +1100
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.263.1357626941.2939.python-list@python.org>
In reply to#36403
On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> given that weather patterns have been known to follow cycles at least
>> that long.
>
> That is not a given. "Weather patterns" don't last for thirty years.
> Perhaps you are talking about climate patterns?

Yes, that's what I meant. In any case, debate about global warming is
quite tangential to the point about statistical validity; it looks
quite significant to show a line going from the bottom of the graph to
the top, but sounds a lot less noteworthy when you see it as a
half-degree increase on about (I think?) 30 degrees, and even less
when you measure temperatures in absolute scale (Kelvin) and it's half
a degree in three hundred. Those are principles worth considering,
regardless of the subject matter. If your railway tracks have widened
by a full eight millimeters due to increased pounding from heavier
vehicles travelling over it, that's significant and dangerous on
HO-scale model trains, but utterly insignificant on 5'3" gauge.

ChrisA

[toc] | [prev] | [next] | [standalone]


#36437 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromRobert Kern <robert.kern@gmail.com>
Date2013-01-08 15:55 +0000
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.281.1357660546.2939.python-list@python.org>
In reply to#36403
On 08/01/2013 06:35, Chris Angelico wrote:
> On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>>> given that weather patterns have been known to follow cycles at least
>>> that long.
>>
>> That is not a given. "Weather patterns" don't last for thirty years.
>> Perhaps you are talking about climate patterns?
>
> Yes, that's what I meant. In any case, debate about global warming is
> quite tangential to the point about statistical validity; it looks
> quite significant to show a line going from the bottom of the graph to
> the top, but sounds a lot less noteworthy when you see it as a
> half-degree increase on about (I think?) 30 degrees, and even less
> when you measure temperatures in absolute scale (Kelvin) and it's half
> a degree in three hundred.

Why on Earth do you think that the distance from nominal surface temperatures to 
freezing much less absolute 0 is the right scale to compare global warming 
changes against? You need to compare against the size of global mean temperature 
changes that would cause large amounts of human suffering, and that scale is on 
the order of a *few* degrees, not hundreds. A change of half a degree over a few 
decades with no signs of slowing down *should* be alarming.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]


#36449 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromChris Angelico <rosuav@gmail.com>
Date2013-01-09 07:14 +1100
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.289.1357676101.2939.python-list@python.org>
In reply to#36403
On Wed, Jan 9, 2013 at 2:55 AM, Robert Kern <robert.kern@gmail.com> wrote:
> On 08/01/2013 06:35, Chris Angelico wrote:
>> ... it looks
>> quite significant to show a line going from the bottom of the graph to
>> the top, but sounds a lot less noteworthy when you see it as a
>> half-degree increase on about (I think?) 30 degrees, and even less
>> when you measure temperatures in absolute scale (Kelvin) and it's half
>> a degree in three hundred.
>
> Why on Earth do you think that the distance from nominal surface
> temperatures to freezing much less absolute 0 is the right scale to compare
> global warming changes against? You need to compare against the size of
> global mean temperature changes that would cause large amounts of human
> suffering, and that scale is on the order of a *few* degrees, not hundreds.
> A change of half a degree over a few decades with no signs of slowing down
> *should* be alarming.

I didn't say what it should be; I gave three examples. And as I said,
this is not the forum to debate climate change; I was just using it as
an example of statistical reporting.

Three types of lies.

ChrisA

[toc] | [prev] | [next] | [standalone]


#36474 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-01-09 07:50 +0000
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<50ed2154$0$29898$c3e8da3$5496439d@news.astraweb.com>
In reply to#36449
On Wed, 09 Jan 2013 07:14:51 +1100, Chris Angelico wrote:

> Three types of lies.

Oh, surely more than that.

White lies. 

Regular or garden variety lies.

Malicious lies.

Accidental or innocent lies.

FUD -- "fear, uncertainty, doubt".

Half-truths.

Lying by omission.

Exaggeration and understatement.

Propaganda.

Misinformation.

Disinformation.

Deceit by emphasis.

And manufactured doubt.

E.g. the decades long campaign by the tobacco companies to deny that 
tobacco products caused cancer, when their own scientists were telling 
them that they did. Having learnt how valuable falsehoods are, those same 
manufacturers of doubt went on to sell their services to those who wanted 
to deny that CFCs destroyed ozone, and that CO2 causes warming. 


The old saw about "lies, damned lies and statistics" reminds me very much 
of a quote from Homer Simpson:

"Pfff, facts, you can prove anything that's even remotely true with 
facts!"



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#36457 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromRobert Kern <robert.kern@gmail.com>
Date2013-01-08 22:59 +0000
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.296.1357685961.2939.python-list@python.org>
In reply to#36403
On 08/01/2013 20:14, Chris Angelico wrote:
> On Wed, Jan 9, 2013 at 2:55 AM, Robert Kern <robert.kern@gmail.com> wrote:
>> On 08/01/2013 06:35, Chris Angelico wrote:
>>> ... it looks
>>> quite significant to show a line going from the bottom of the graph to
>>> the top, but sounds a lot less noteworthy when you see it as a
>>> half-degree increase on about (I think?) 30 degrees, and even less
>>> when you measure temperatures in absolute scale (Kelvin) and it's half
>>> a degree in three hundred.
>>
>> Why on Earth do you think that the distance from nominal surface
>> temperatures to freezing much less absolute 0 is the right scale to compare
>> global warming changes against? You need to compare against the size of
>> global mean temperature changes that would cause large amounts of human
>> suffering, and that scale is on the order of a *few* degrees, not hundreds.
>> A change of half a degree over a few decades with no signs of slowing down
>> *should* be alarming.
>
> I didn't say what it should be;

Actually, you did. You stated that "a ~0.6 deg increase across ~30 years [is 
h]ardly statistically significant". Ignoring the confusion between statistical 
significance and practical significance (as external criteria like the 
difference between the nominal temp and absolute 0 or the right criteria that I 
mentioned has nothing to do with statistical significance), you made a positive 
claim that it wasn't significant.

> I gave three examples.

You gave negligently incorrect ones. Whether your comments were on topic or not, 
you deserve to be called on them when they are wrong.

> And as I said,
> this is not the forum to debate climate change; I was just using it as
> an example of statistical reporting.
>
> Three types of lies.

FUD is a fourth.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]


#36377 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-01-07 22:32 +0000
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.239.1357597977.2939.python-list@python.org>
In reply to#36368
On 7 January 2013 17:58, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:
>
>> There are sometimes good reasons to get a line of best fit by eye. In
>> particular if your data contains clusters that are hard to separate,
>> sometimes it's useful to just pick out roughly where you think a line
>> through a subset of the data is.
>
> Cherry picking subsets of your data as well as line fitting by eye? Two
> wrongs do not make a right.

It depends on what you're doing, though. I wouldn't use an eyeball fit
to get numbers that were an important part of the conclusion of some
or other study. I would very often use it while I'm just in the
process of trying to understand something.

> If you're going to just invent a line based on where you think it should
> be, what do you need the data for? Just declare "this is the line I wish
> to believe in" and save yourself the time and energy of collecting the
> data in the first place. Your conclusion will be no less valid.

An example: Earlier today I was looking at some experimental data. A
simple model of the process underlying the experiment suggests that
two variables x and y will vary in direct proportion to one another
and the data broadly reflects this. However, at this stage there is
some non-normal variability in the data, caused by experimental
difficulties. A subset of the data appears to closely follow a well
defined linear pattern but there are outliers and the pattern breaks
down in an asymmetric way at larger x and y values. At some later time
either the sources of experimental variation will be reduced, or they
will be better understood but for now it is still useful to estimate
the constant of proportionality in order to check whether it seems
consistent with the observed values of z. With this particular dataset
I would have wasted a lot of time if I had tried to find a
computational method to match the line that to me was very visible so
I chose the line visually.

>
> How do you distinguish between "data contains clusters that are hard to
> separate" from "data doesn't fit a line at all"?
>

In the example I gave it isn't possible to make that distinction with
the currently available data. That doesn't make it meaningless to try
and estimate the parameters of the relationship between the variables
using the preliminary data.

> Even if the data actually is linear, on what basis could we distinguish
> between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
> by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
> subjective judgement can be equally denied on the basis of subjective
> judgement.

It gets a bit easier if the line is constrained to go through the
origin. You seem to be thinking that the important thing is proving
that the line is "real", rather than identifying where it is. Both
things are important but not necessarily in the same problem. In my
example, the "real line" may not be straight and may not go through
the origin, but it is definitely there and if there were no
experimental problems then the data would all be very close to it.

> Anyone can fool themselves into placing a line through a subset of non-
> linear data. Or, sadly more often, *deliberately* cherry picking fake
> clusters in order to fool others. Here is a real world example of what
> happens when people pick out the data clusters that they like based on
> visual inspection:
>
> http://www.skepticalscience.com/images/TempEscalator.gif
>
> And not linear by any means, but related to the cherry picking theme:
>
> http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif
>
>
> To put it another way, when we fit patterns to data by eye, we can easily
> fool ourselves into seeing patterns that aren't there, or missing the
> patterns which are there. At best line fitting by eye is prone to honest
> errors; at worst, it is open to the most deliberate abuse. We have eyes
> and brains that evolved to spot the ripe fruit in trees, not to spot
> linear trends in noisy data, and fitting by eye is not safe or
> appropriate.

This is all true. But the human brain is also in many ways much better
than a typical computer program at recognising patterns in data when
the data can be depicted visually. I would very rarely attempt to
analyse data without representing it in some visual form. I also think
it would be highly foolish to go so far with refusing to eyeball data
that you would accept the output of some regression algorithm even
when it clearly looks wrong.


Oscar

[toc] | [prev] | [next] | [standalone]


#36397 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-01-08 01:23 +0000
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<50eb7513$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to#36377
On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:

> An example: Earlier today I was looking at some experimental data. A
> simple model of the process underlying the experiment suggests that two
> variables x and y will vary in direct proportion to one another and the
> data broadly reflects this. However, at this stage there is some
> non-normal variability in the data, caused by experimental difficulties.
> A subset of the data appears to closely follow a well defined linear
> pattern but there are outliers and the pattern breaks down in an
> asymmetric way at larger x and y values. At some later time either the
> sources of experimental variation will be reduced, or they will be
> better understood but for now it is still useful to estimate the
> constant of proportionality in order to check whether it seems
> consistent with the observed values of z. With this particular dataset I
> would have wasted a lot of time if I had tried to find a computational
> method to match the line that to me was very visible so I chose the line
> visually.


If you mean:

"I looked at the data, identified that the range a < x < b looks linear 
and the range x > b does not, then used least squares (or some other 
recognised, objective technique for fitting a line) to the data in that 
linear range"

then I'm completely cool with that. That's fine, with the understanding 
that this is the first step in either fixing your measurement problems, 
fixing your model, or at least avoiding extrapolation into the non-linear 
range.

But that is not fitting a line by eye, which is what I am talking about.

If on the other hand you mean:

"I looked at the data, identified that the range a < x < b looked linear, 
so I laid a ruler down over the graph and pushed it around until I was 
satisfied that the ruler looked more or less like it fitted the data 
points, according to my guess of what counts as a close fit"

that *is* fitting a line by eye, and it is entirely subjective and 
extremely dodgy for anything beyond quick and dirty back of the envelope 
calculations[1]. That's okay if all you want is to get something within 
an order of magnitude or so, or a line roughly pointing in the right 
direction, but that's all.


[...]
> I also think it would
> be highly foolish to go so far with refusing to eyeball data that you
> would accept the output of some regression algorithm even when it
> clearly looks wrong.

I never said anything of the sort.

I said, don't fit lines to data by eye. I didn't say not to sanity check 
your straight line fit is reasonable by eyeballing it.



[1] Or if your data is so accurate and noise-free that you hardly have to 
care about errors, since there clearly is one and only one straight line 
that passes through all the points.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#36421 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

FromTerry Reedy <tjreedy@udel.edu>
Date2013-01-08 04:07 -0500
SubjectRe: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID<mailman.268.1357636078.2939.python-list@python.org>
In reply to#36397
On 1/7/2013 8:23 PM, Steven D'Aprano wrote:
> On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
>
>> An example: Earlier today I was looking at some experimental data. A
>> simple model of the process underlying the experiment suggests that two
>> variables x and y will vary in direct proportion to one another and the
>> data broadly reflects this. However, at this stage there is some
>> non-normal variability in the data, caused by experimental difficulties.
>> A subset of the data appears to closely follow a well defined linear
>> pattern but there are outliers and the pattern breaks down in an
>> asymmetric way at larger x and y values. At some later time either the
>> sources of experimental variation will be reduced, or they will be
>> better understood but for now it is still useful to estimate the
>> constant of proportionality in order to check whether it seems
>> consistent with the observed values of z. With this particular dataset I
>> would have wasted a lot of time if I had tried to find a computational
>> method to match the line that to me was very visible so I chose the line
>> visually.
>
>
> If you mean:
>
> "I looked at the data, identified that the range a < x < b looks linear
> and the range x > b does not, then used least squares (or some other
> recognised, objective technique for fitting a line) to the data in that
> linear range"
>
> then I'm completely cool with that.

If both x and y are measured values, then regressing x on y and y on x 
with give different answers and both will be wrong in that *neither* 
will be the best answer for the relationship between them. Oscar did not 
specify whether either was an experimentally set input variable.

> But that is not fitting a line by eye, which is what I am talking about.

With the line constrained to go through 0,0, a line eyeballed with a 
clear ruler could easily be better than either regression line, as a 
human will tend to minimize the deviations *perpendicular to the  line*, 
which is the proper thing to do (assuming both variables are measured in 
the same units).

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web