Groups > comp.lang.python > #36279 > unrolled thread

Numpy outlier removal

Started by	"Joseph L. Casale" <jcasale@activenetwerx.com>
First post	2013-01-06 19:44 +0000
Last post	2013-01-07 02:12 +0000
Articles	8 on this page of 28 — 11 participants

Back to article view | Back to comp.lang.python

  Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-06 19:44 +0000
    Re: Numpy outlier removal Hans Mulder <hansmu@xs4all.nl> - 2013-01-06 23:33 +0100
      RE: Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-06 22:50 +0000
      Re: Numpy outlier removal MRAB <python@mrabarnett.plus.com> - 2013-01-06 23:18 +0000
    Re: Numpy outlier removal Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 01:46 +0000
      Re: Numpy outlier removal "Paul Simon" <psimon@sonic.net> - 2013-01-06 18:21 -0800
      Re: Numpy outlier removal Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 02:29 +0000
        Re: Numpy outlier removal Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 05:11 +0000
          Re: Numpy outlier removal Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 15:20 +0000
            [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 17:58 +0000
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-08 06:43 +1100
                Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-08 02:06 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-08 17:35 +1100
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Robert Kern <robert.kern@gmail.com> - 2013-01-08 15:55 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-09 07:14 +1100
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-09 07:50 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Robert Kern <robert.kern@gmail.com> - 2013-01-08 22:59 +0000
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 22:32 +0000
                Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-08 01:23 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Terry Reedy <tjreedy@udel.edu> - 2013-01-08 04:07 -0500
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Maarten <maarten.sneep@knmi.nl> - 2013-01-08 08:47 -0800
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Maarten <maarten.sneep@knmi.nl> - 2013-01-08 08:47 -0800
                    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-09 00:02 +0000
                  Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-08 13:50 +0000
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Jason Friedman <jason@powerpull.net> - 2013-01-08 19:22 -0700
              Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Jason Friedman <jason@powerpull.net> - 2013-01-08 19:23 -0700
          Re: Numpy outlier removal Robert Kern <robert.kern@gmail.com> - 2013-01-07 15:35 +0000
      RE: Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-07 02:12 +0000

Page 2 of 2 — ← Prev page 1 [2]

#36441 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

From	Maarten <maarten.sneep@knmi.nl>
Date	2013-01-08 08:47 -0800
Subject	Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID	<7dafc98b-99c9-4727-bdb5-087dc846546c@googlegroups.com>
In reply to	#36421

On Tuesday, January 8, 2013 10:07:08 AM UTC+1, Terry Reedy wrote:

> With the line constrained to go through 0,0, a line eyeballed with a 
> clear ruler could easily be better than either regression line, as a 
> human will tend to minimize the deviations *perpendicular to the  line*, 
> which is the proper thing to do (assuming both variables are measured in 
> the same units).

In that case use an appropriate algorithm to perform the fit. ODR comes to mind. http://docs.scipy.org/doc/scipy/reference/odr.html

Maarten

[toc] | [prev] | [next] | [standalone]

#36442 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

From	Maarten <maarten.sneep@knmi.nl>
Date	2013-01-08 08:47 -0800
Subject	Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID	<mailman.283.1357663681.2939.python-list@python.org>
In reply to	#36421

On Tuesday, January 8, 2013 10:07:08 AM UTC+1, Terry Reedy wrote:

> With the line constrained to go through 0,0, a line eyeballed with a 
> clear ruler could easily be better than either regression line, as a 
> human will tend to minimize the deviations *perpendicular to the  line*, 
> which is the proper thing to do (assuming both variables are measured in 
> the same units).

In that case use an appropriate algorithm to perform the fit. ODR comes to mind. http://docs.scipy.org/doc/scipy/reference/odr.html

Maarten

[toc] | [prev] | [next] | [standalone]

#36460 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-01-09 00:02 +0000
Subject	Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID	<50ecb382$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to	#36421

On Tue, 08 Jan 2013 04:07:08 -0500, Terry Reedy wrote:

>> But that is not fitting a line by eye, which is what I am talking
>> about.
> 
> With the line constrained to go through 0,0 a line eyeballed with a 
> clear ruler could easily be better than either regression line, as a
> human will tend to minimize the deviations *perpendicular to the line*,
> which is the proper thing to do (assuming both variables are measured
> in the same units).

It is conventional to talk about "residuals" rather than deviations.

And it could even more easily be worse than a regression line. And since 
eyeballing is entirely subjective and impossible to objectively verify, 
the line that you claim minimizes the residuals might be very different 
from the line that I claim minimizes the residuals, and no way to decide 
between the two claims.

In any case, there is a technique for working out ordinary least squares 
(OLS) linear regression using perpendicular offsets rather than vertical 
offsets:

http://mathworld.wolfram.com/LeastSquaresFittingPerpendicularOffsets.html

but in general, if you have to care about errors in the dependent 
variable, you're better off using a more powerful technique than just OLS.

The point I keep making, that everybody seems to be ignoring, is that 
eyeballing a line of best fit is subjective, unreliable and impossible to 
verify. How could I check that the line you say is the "best fit" 
actually *is* the *best fit* for the given data, given that you picked 
that line by eye? Chances are good that if you came back to the data a 
month later, you'd pick a different line!

As I have said, eyeballing a line is fine for rough back of the envelope 
type calculations, where you only care that you have a line pointing more 
or less in the right direction. But for anything where accuracy is 
required, line fitting by eye is down in the pits of things not to do, 
right next to "making up the answers you prefer".

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#36430 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-01-08 13:50 +0000
Subject	Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID	<mailman.275.1357653482.2939.python-list@python.org>
In reply to	#36397

On 8 January 2013 01:23, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
>
> [...]
>> I also think it would
>> be highly foolish to go so far with refusing to eyeball data that you
>> would accept the output of some regression algorithm even when it
>> clearly looks wrong.
>
> I never said anything of the sort.
>
> I said, don't fit lines to data by eye. I didn't say not to sanity check
> your straight line fit is reasonable by eyeballing it.

I should have been a little clearer. That was the situation when I
decided to just use a (digital) ruler - although really it was more of
a visual bisection (1, 2, 1.5, 1.25...). The regression result was
clearly wrong (and also invalid for the reasons Terry has described).
Some of the problems were easily fixable and others were not. I could
have spent an hour getting the code to make the line go where I wanted
it to, or I could just fit the line visually in about 2 minutes.

Oscar

[toc] | [prev] | [next] | [standalone]

#36464 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

From	Jason Friedman <jason@powerpull.net>
Date	2013-01-08 19:22 -0700
Subject	Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID	<mailman.301.1357698180.2939.python-list@python.org>
In reply to	#36368

> Statistical analysis is a huge science. So is lying. And I'm not sure
> most people can pick one from the other.

Chris, your sentence causes me to think of Mr. Twain's sentence, or at
least the one he popularized:
http://www.twainquotes.com/Statistics.html.

[toc] | [prev] | [next] | [standalone]

#36465 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

From	Jason Friedman <jason@powerpull.net>
Date	2013-01-08 19:23 -0700
Subject	Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]
Message-ID	<mailman.302.1357698206.2939.python-list@python.org>
In reply to	#36368

> Statistical analysis is a huge science. So is lying. And I'm not sure
> most people can pick one from the other.

Chris, your sentence causes me to think of Mr. Twain's sentence, or at
least the one he popularized:
http://www.twainquotes.com/Statistics.html.

[toc] | [prev] | [next] | [standalone]

#36357

From	Robert Kern <robert.kern@gmail.com>
Date	2013-01-07 15:35 +0000
Message-ID	<mailman.224.1357572924.2939.python-list@python.org>
In reply to	#36321

On 07/01/2013 15:20, Oscar Benjamin wrote:
> On 7 January 2013 05:11, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
>>
>>> On 7 January 2013 01:46, Steven D'Aprano
>>> <steve+comp.lang.python@pearwood.info> wrote:
>>>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>>>
>>>> I'm not sure that this approach is statistically robust. No, let me be
>>>> even more assertive: I'm sure that this approach is NOT statistically
>>>> robust, and may be scientifically dubious.
>>>
>>> Whether or not this is "statistically robust" requires more explanation
>>> about the OP's intention.
>>
>> Not really. Statistics robustness is objectively defined, and the user's
>> intention doesn't come into it. The mean is not a robust measure of
>> central tendency, the median is, regardless of why you pick one or the
>> other.
>
> Okay, I see what you mean. I wasn't thinking of robustness as a
> technical term but now I see that you are correct.
>
> Perhaps what I should have said is that whether or not this matters
> depends on the problem at hand (hopefully this isn't an important
> medical trial) and the particular type of data that you have; assuming
> normality is fine in many cases even if the data is not "really"
> normal.

"Having outliers" literally means that assuming normality is not fine. If 
assuming normality were fine, then you wouldn't need to remove outliers.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]

#36339

From	"Joseph L. Casale" <jcasale@activenetwerx.com>
Date	2013-01-07 02:12 +0000
Message-ID	<mailman.212.1357552594.2939.python-list@python.org>
In reply to	#36314

> In other words: this approach for detecting outliers is nothing more than 

> a very rough, and very bad, heuristic, and should be avoided.

Heh, very true but the results will only be used for conversational purposes.
I am making an assumption that the data is normally distributed and I do expect
valid results to all be very nearly the same.

> You can read up more about outlier detection, and the difficulties 
> thereof, here:


I much appreciate the links and the thought in the post. I'll admit I didn't
realize outlier detection was as involved.


Again, thanks!
jlc

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

Numpy outlier removal

Contents

#36441 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

#36442 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

#36460 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

#36430 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

#36464 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

#36465 — Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

#36357

#36339