Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #36279 > unrolled thread
| Started by | "Joseph L. Casale" <jcasale@activenetwerx.com> |
|---|---|
| First post | 2013-01-06 19:44 +0000 |
| Last post | 2013-01-07 02:12 +0000 |
| Articles | 20 on this page of 28 — 11 participants |
Back to article view | Back to comp.lang.python
Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-06 19:44 +0000
Re: Numpy outlier removal Hans Mulder <hansmu@xs4all.nl> - 2013-01-06 23:33 +0100
RE: Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-06 22:50 +0000
Re: Numpy outlier removal MRAB <python@mrabarnett.plus.com> - 2013-01-06 23:18 +0000
Re: Numpy outlier removal Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 01:46 +0000
Re: Numpy outlier removal "Paul Simon" <psimon@sonic.net> - 2013-01-06 18:21 -0800
Re: Numpy outlier removal Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 02:29 +0000
Re: Numpy outlier removal Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 05:11 +0000
Re: Numpy outlier removal Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 15:20 +0000
[Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-07 17:58 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-08 06:43 +1100
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-08 02:06 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-08 17:35 +1100
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Robert Kern <robert.kern@gmail.com> - 2013-01-08 15:55 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Chris Angelico <rosuav@gmail.com> - 2013-01-09 07:14 +1100
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-09 07:50 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Robert Kern <robert.kern@gmail.com> - 2013-01-08 22:59 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-07 22:32 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-08 01:23 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Terry Reedy <tjreedy@udel.edu> - 2013-01-08 04:07 -0500
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Maarten <maarten.sneep@knmi.nl> - 2013-01-08 08:47 -0800
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Maarten <maarten.sneep@knmi.nl> - 2013-01-08 08:47 -0800
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-09 00:02 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-01-08 13:50 +0000
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Jason Friedman <jason@powerpull.net> - 2013-01-08 19:22 -0700
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Jason Friedman <jason@powerpull.net> - 2013-01-08 19:23 -0700
Re: Numpy outlier removal Robert Kern <robert.kern@gmail.com> - 2013-01-07 15:35 +0000
RE: Numpy outlier removal "Joseph L. Casale" <jcasale@activenetwerx.com> - 2013-01-07 02:12 +0000
Page 1 of 2 [1] 2 Next page →
| From | "Joseph L. Casale" <jcasale@activenetwerx.com> |
|---|---|
| Date | 2013-01-06 19:44 +0000 |
| Subject | Numpy outlier removal |
| Message-ID | <mailman.179.1357501521.2939.python-list@python.org> |
I have a dataset that consists of a dict with text descriptions and values that are integers. If required, I collect the values into a list and create a numpy array running it through a simple routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations to include. The problem is I loos track of which were removed so the original display of the dataset is misleading when the processed average is returned as it includes the removed key/values. Ayone know how I can maintain the relationship and when I exclude a value, remove it from the dict? Thanks! jlc
[toc] | [next] | [standalone]
| From | Hans Mulder <hansmu@xs4all.nl> |
|---|---|
| Date | 2013-01-06 23:33 +0100 |
| Message-ID | <50e9fbd5$0$6848$e4fe514c@news2.news.xs4all.nl> |
| In reply to | #36279 |
On 6/01/13 20:44:08, Joseph L. Casale wrote:
> I have a dataset that consists of a dict with text descriptions and values that are integers. If
> required, I collect the values into a list and create a numpy array running it through a simple
> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
> to include.
>
>
> The problem is I loos track of which were removed so the original display of the dataset is
> misleading when the processed average is returned as it includes the removed key/values.
>
>
> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
> the dict?
Assuming your data and the dictionary are keyed by a common set of keys:
for key in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
del data[key]
del descriptions[key]
Hope this helps,
-- HansM
[toc] | [prev] | [next] | [standalone]
| From | "Joseph L. Casale" <jcasale@activenetwerx.com> |
|---|---|
| Date | 2013-01-06 22:50 +0000 |
| Message-ID | <mailman.194.1357512697.2939.python-list@python.org> |
| In reply to | #36296 |
>Assuming your data and the dictionary are keyed by a common set of keys: > >for key in descriptions: > if abs(data[key] - mean(data)) >= m * std(data): > del data[key] > del descriptions[key] Heh, yeah sometimes the obvious is too simple to see. I used a dict comp to rebuild the results with the comparison. Thanks! jlc
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-01-06 23:18 +0000 |
| Message-ID | <mailman.196.1357514503.2939.python-list@python.org> |
| In reply to | #36296 |
On 2013-01-06 22:33, Hans Mulder wrote:
> On 6/01/13 20:44:08, Joseph L. Casale wrote:
>> I have a dataset that consists of a dict with text descriptions and values that are integers. If
>> required, I collect the values into a list and create a numpy array running it through a simple
>> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
>> to include.
>>
>>
>> The problem is I loos track of which were removed so the original display of the dataset is
>> misleading when the processed average is returned as it includes the removed key/values.
>>
>>
>> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
>> the dict?
>
> Assuming your data and the dictionary are keyed by a common set of keys:
>
> for key in descriptions:
> if abs(data[key] - mean(data)) >= m * std(data):
> del data[key]
> del descriptions[key]
>
It's generally a bad idea to modify a collection over which you're
iterating. It's better to, say, make a list of what you're going to
delete and then iterate over that list to make the deletions:
deletions = []
for key in in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
deletions.append(key)
for key in deletions:
del data[key]
del descriptions[key]
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-01-07 01:46 +0000 |
| Message-ID | <50ea28e7$0$30003$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #36279 |
On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: > I have a dataset that consists of a dict with text descriptions and > values that are integers. If required, I collect the values into a list > and create a numpy array running it through a simple routine: > > data[abs(data - mean(data)) < m * std(data)] > > where m is the number of std deviations to include. I'm not sure that this approach is statistically robust. No, let me be even more assertive: I'm sure that this approach is NOT statistically robust, and may be scientifically dubious. The above assumes your data is normally distributed. How sure are you that this is actually the case? For normally distributed data: Since both the mean and std calculations as effected by the presence of outliers, your test for what counts as an outlier will miss outliers for data from a normal distribution. For small N (sample size), it may be mathematically impossible for any data point to be greater than m*SD from the mean. For example, with N=5, no data point can be more than 1.789*SD from the mean. So for N=5, m=1 may throw away good data, and m=2 will fail to find any outliers no matter how outrageous they are. For large N, you will expect to find significant numbers of data points more than m*SD from the mean. With N=100000, and m=3, you will expect to throw away 270 perfectly good data points simply because they are out on the tails of the distribution. Worse, if the data is not in fact from a normal distribution, all bets are off. You may be keeping obvious outliers; or more often, your test will be throwing away perfectly good data that it misidentifies as outliers. In other words: this approach for detecting outliers is nothing more than a very rough, and very bad, heuristic, and should be avoided. Identifying outliers is fraught with problems even for experts. For example, the ozone hole over the Antarctic was ignored for many years because the software being used to analyse it misidentified the data as outliers. The best general advice I have seen is: Never automatically remove outliers except for values that are physically impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"), unless you have good, solid, physical reasons for justifying removal of outliers. Other than that, manually remove outliers with care, or not at all, and if you do so, always report your results twice, once with all the data, and once with supposed outliers removed. You can read up more about outlier detection, and the difficulties thereof, here: http://www.medcalc.org/manual/outliers.php https://secure.graphpad.com/guides/prism/6/statistics/index.htm http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations -- Steven
[toc] | [prev] | [next] | [standalone]
| From | "Paul Simon" <psimon@sonic.net> |
|---|---|
| Date | 2013-01-06 18:21 -0800 |
| Message-ID | <50ea3199$0$80136$742ec2ed@news.sonic.net> |
| In reply to | #36314 |
"Steven D'Aprano" <steve+comp.lang.python@pearwood.info> wrote in message news:50ea28e7$0$30003$c3e8da3$5496439d@news.astraweb.com... > On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: > >> I have a dataset that consists of a dict with text descriptions and >> values that are integers. If required, I collect the values into a list >> and create a numpy array running it through a simple routine: >> >> data[abs(data - mean(data)) < m * std(data)] >> >> where m is the number of std deviations to include. > > I'm not sure that this approach is statistically robust. No, let me be > even more assertive: I'm sure that this approach is NOT statistically > robust, and may be scientifically dubious. > > The above assumes your data is normally distributed. How sure are you > that this is actually the case? > > For normally distributed data: > > Since both the mean and std calculations as effected by the presence of > outliers, your test for what counts as an outlier will miss outliers for > data from a normal distribution. For small N (sample size), it may be > mathematically impossible for any data point to be greater than m*SD from > the mean. For example, with N=5, no data point can be more than 1.789*SD > from the mean. So for N=5, m=1 may throw away good data, and m=2 will > fail to find any outliers no matter how outrageous they are. > > For large N, you will expect to find significant numbers of data points > more than m*SD from the mean. With N=100000, and m=3, you will expect to > throw away 270 perfectly good data points simply because they are out on > the tails of the distribution. > > Worse, if the data is not in fact from a normal distribution, all bets > are off. You may be keeping obvious outliers; or more often, your test > will be throwing away perfectly good data that it misidentifies as > outliers. > > In other words: this approach for detecting outliers is nothing more than > a very rough, and very bad, heuristic, and should be avoided. > > Identifying outliers is fraught with problems even for experts. For > example, the ozone hole over the Antarctic was ignored for many years > because the software being used to analyse it misidentified the data as > outliers. > > The best general advice I have seen is: > > Never automatically remove outliers except for values that are physically > impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"), > unless you have good, solid, physical reasons for justifying removal of > outliers. Other than that, manually remove outliers with care, or not at > all, and if you do so, always report your results twice, once with all > the data, and once with supposed outliers removed. > > You can read up more about outlier detection, and the difficulties > thereof, here: > > http://www.medcalc.org/manual/outliers.php > > https://secure.graphpad.com/guides/prism/6/statistics/index.htm > > http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html > > http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations > > > > -- > Steven If you suspect that the data may not be normal you might look at exploratory data analysis, see Tukey. It's descriptive rather than analytic, treats outliers respectfully, uses median rather than mean, and is very visual. Wherever I analyzed data both gaussian and with EDA, EDA always won. Paul
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-01-07 02:29 +0000 |
| Message-ID | <mailman.205.1357525775.2939.python-list@python.org> |
| In reply to | #36314 |
On 7 January 2013 01:46, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: > >> I have a dataset that consists of a dict with text descriptions and >> values that are integers. If required, I collect the values into a list >> and create a numpy array running it through a simple routine: >> >> data[abs(data - mean(data)) < m * std(data)] >> >> where m is the number of std deviations to include. > > I'm not sure that this approach is statistically robust. No, let me be > even more assertive: I'm sure that this approach is NOT statistically > robust, and may be scientifically dubious. Whether or not this is "statistically robust" requires more explanation about the OP's intention. Thus far, the OP has not given any reason/motivation for excluding data or even for having any data in the first place! It's hard to say whether any technique applied is really accurate/robust without knowing *anything* about the purpose of the operation. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-01-07 05:11 +0000 |
| Message-ID | <50ea58f0$0$21851$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #36316 |
On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote: > On 7 January 2013 01:46, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: >> >>> I have a dataset that consists of a dict with text descriptions and >>> values that are integers. If required, I collect the values into a >>> list and create a numpy array running it through a simple routine: >>> >>> data[abs(data - mean(data)) < m * std(data)] >>> >>> where m is the number of std deviations to include. >> >> I'm not sure that this approach is statistically robust. No, let me be >> even more assertive: I'm sure that this approach is NOT statistically >> robust, and may be scientifically dubious. > > Whether or not this is "statistically robust" requires more explanation > about the OP's intention. Not really. Statistics robustness is objectively defined, and the user's intention doesn't come into it. The mean is not a robust measure of central tendency, the median is, regardless of why you pick one or the other. There are sometimes good reasons for choosing non-robust statistics or techniques over robust ones, but some techniques are so dodgy that there is *never* a good reason for doing so. E.g. finding the line of best fit by eye, or taking more and more samples until you get a statistically significant result. Such techniques are not just non-robust in the statistical sense, but non-robust in the general sense, if not outright deceitful. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-01-07 15:20 +0000 |
| Message-ID | <mailman.223.1357572059.2939.python-list@python.org> |
| In reply to | #36321 |
On 7 January 2013 05:11, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote: > >> On 7 January 2013 01:46, Steven D'Aprano >> <steve+comp.lang.python@pearwood.info> wrote: >>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: >>> >>> I'm not sure that this approach is statistically robust. No, let me be >>> even more assertive: I'm sure that this approach is NOT statistically >>> robust, and may be scientifically dubious. >> >> Whether or not this is "statistically robust" requires more explanation >> about the OP's intention. > > Not really. Statistics robustness is objectively defined, and the user's > intention doesn't come into it. The mean is not a robust measure of > central tendency, the median is, regardless of why you pick one or the > other. Okay, I see what you mean. I wasn't thinking of robustness as a technical term but now I see that you are correct. Perhaps what I should have said is that whether or not this matters depends on the problem at hand (hopefully this isn't an important medical trial) and the particular type of data that you have; assuming normality is fine in many cases even if the data is not "really" normal. > > There are sometimes good reasons for choosing non-robust statistics or > techniques over robust ones, but some techniques are so dodgy that there > is *never* a good reason for doing so. E.g. finding the line of best fit > by eye, or taking more and more samples until you get a statistically > significant result. Such techniques are not just non-robust in the > statistical sense, but non-robust in the general sense, if not outright > deceitful. There are sometimes good reasons to get a line of best fit by eye. In particular if your data contains clusters that are hard to separate, sometimes it's useful to just pick out roughly where you think a line through a subset of the data is. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-01-07 17:58 +0000 |
| Subject | [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <50eb0cd2$0$30003$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #36355 |
On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote: > There are sometimes good reasons to get a line of best fit by eye. In > particular if your data contains clusters that are hard to separate, > sometimes it's useful to just pick out roughly where you think a line > through a subset of the data is. Cherry picking subsets of your data as well as line fitting by eye? Two wrongs do not make a right. If you're going to just invent a line based on where you think it should be, what do you need the data for? Just declare "this is the line I wish to believe in" and save yourself the time and energy of collecting the data in the first place. Your conclusion will be no less valid. How do you distinguish between "data contains clusters that are hard to separate" from "data doesn't fit a line at all"? Even if the data actually is linear, on what basis could we distinguish between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely subjective judgement can be equally denied on the basis of subjective judgement. Anyone can fool themselves into placing a line through a subset of non- linear data. Or, sadly more often, *deliberately* cherry picking fake clusters in order to fool others. Here is a real world example of what happens when people pick out the data clusters that they like based on visual inspection: http://www.skepticalscience.com/images/TempEscalator.gif And not linear by any means, but related to the cherry picking theme: http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif To put it another way, when we fit patterns to data by eye, we can easily fool ourselves into seeing patterns that aren't there, or missing the patterns which are there. At best line fitting by eye is prone to honest errors; at worst, it is open to the most deliberate abuse. We have eyes and brains that evolved to spot the ripe fruit in trees, not to spot linear trends in noisy data, and fitting by eye is not safe or appropriate. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-01-08 06:43 +1100 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.237.1357587833.2939.python-list@python.org> |
| In reply to | #36368 |
On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Anyone can fool themselves into placing a line through a subset of non- > linear data. Or, sadly more often, *deliberately* cherry picking fake > clusters in order to fool others. Here is a real world example of what > happens when people pick out the data clusters that they like based on > visual inspection: > > http://www.skepticalscience.com/images/TempEscalator.gif And sensible people will notice that, even drawn like that, it's only a ~0.6 deg increase across ~30 years. Hardly statistically significant, given that weather patterns have been known to follow cycles at least that long. But that's nothing to do with drawing lines through points, and more to do with how much data you collect before you announce a conclusion, and how easily a graph can prove any point you like. Statistical analysis is a huge science. So is lying. And I'm not sure most people can pick one from the other. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-01-08 02:06 +0000 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <50eb7f2a$0$30003$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #36373 |
On Tue, 08 Jan 2013 06:43:46 +1100, Chris Angelico wrote: > On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> Anyone can fool themselves into placing a line through a subset of non- >> linear data. Or, sadly more often, *deliberately* cherry picking fake >> clusters in order to fool others. Here is a real world example of what >> happens when people pick out the data clusters that they like based on >> visual inspection: >> >> http://www.skepticalscience.com/images/TempEscalator.gif > > And sensible people will notice that, even drawn like that, it's only a > ~0.6 deg increase across ~30 years. Hardly statistically significant, Well, I don't know about "sensible people", but magnitude of an effect has little to do with whether or not something is statistically significant or not. Given noisy data, statistical significance relates to whether or not we can be confident that the effect is *real*, not whether it is a big effect or a small effect. Here's an example: assume that you are on a fixed salary with a constant weekly income. If you happen to win the lottery one day, and consequently your income for that week quadruples, that is a large effect that fails to have any statistical significance -- it's a blip, not part of any long- term change in income. You can't conclude that you'll win the lottery every week from now on. On the other hand, if the government changes the rules relating to tax, deductions, etc., even by a small amount, your weekly income might go down, or up, by a single dollar. Even though that is a tiny effect, it is *not* a blip, and will be statistically significant. In practice, it takes a certain number of data points to reach that confidence level. Your accountant, who knows the tax laws, will conclude that the change is real immediately, but a statistician who sees only the pay slips may take some months before she is convinced that the change is signal rather than noise. With only three weeks pay slips in hand, the statistician cannot be sure that the difference is not just some accounting error or other fluke, but each additional data point increases the confidence that the difference is real and not just some temporary aberration. The other meaning of "significant" has nothing to do with statistics, and everything to do with "a difference is only a difference if it makes a difference". 0.2° per decade doesn't sound like much, not when we consider daily or yearly temperatures that typically have a range of tens of degrees between night and day, or winter and summer. But that is misunderstanding the nature of long-term climate versus daily weather and glossing over the fact that we're only talking about an average and ignoring changes to the variability of the climate: a small increase in average can lead to a large increase in extreme events. > given that weather patterns have been known to follow cycles at least > that long. That is not a given. "Weather patterns" don't last for thirty years. Perhaps you are talking about climate patterns? In which case, well, yes, we can see a very strong climate pattern of warming on a time scale of decades, with no evidence that it is a cycle. There are, of course, many climate cycles that take place on a time frame of years or decades, such as the North Atlantic Oscillation and the El Nino Southern Oscillation. None of them are global, and as far as I know none of them are exactly periodic. They are noise in the system, and certainly not responsible for linear trends. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-01-08 17:35 +1100 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.263.1357626941.2939.python-list@python.org> |
| In reply to | #36403 |
On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >> given that weather patterns have been known to follow cycles at least >> that long. > > That is not a given. "Weather patterns" don't last for thirty years. > Perhaps you are talking about climate patterns? Yes, that's what I meant. In any case, debate about global warming is quite tangential to the point about statistical validity; it looks quite significant to show a line going from the bottom of the graph to the top, but sounds a lot less noteworthy when you see it as a half-degree increase on about (I think?) 30 degrees, and even less when you measure temperatures in absolute scale (Kelvin) and it's half a degree in three hundred. Those are principles worth considering, regardless of the subject matter. If your railway tracks have widened by a full eight millimeters due to increased pounding from heavier vehicles travelling over it, that's significant and dangerous on HO-scale model trains, but utterly insignificant on 5'3" gauge. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2013-01-08 15:55 +0000 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.281.1357660546.2939.python-list@python.org> |
| In reply to | #36403 |
On 08/01/2013 06:35, Chris Angelico wrote: > On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >>> given that weather patterns have been known to follow cycles at least >>> that long. >> >> That is not a given. "Weather patterns" don't last for thirty years. >> Perhaps you are talking about climate patterns? > > Yes, that's what I meant. In any case, debate about global warming is > quite tangential to the point about statistical validity; it looks > quite significant to show a line going from the bottom of the graph to > the top, but sounds a lot less noteworthy when you see it as a > half-degree increase on about (I think?) 30 degrees, and even less > when you measure temperatures in absolute scale (Kelvin) and it's half > a degree in three hundred. Why on Earth do you think that the distance from nominal surface temperatures to freezing much less absolute 0 is the right scale to compare global warming changes against? You need to compare against the size of global mean temperature changes that would cause large amounts of human suffering, and that scale is on the order of a *few* degrees, not hundreds. A change of half a degree over a few decades with no signs of slowing down *should* be alarming. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-01-09 07:14 +1100 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.289.1357676101.2939.python-list@python.org> |
| In reply to | #36403 |
On Wed, Jan 9, 2013 at 2:55 AM, Robert Kern <robert.kern@gmail.com> wrote: > On 08/01/2013 06:35, Chris Angelico wrote: >> ... it looks >> quite significant to show a line going from the bottom of the graph to >> the top, but sounds a lot less noteworthy when you see it as a >> half-degree increase on about (I think?) 30 degrees, and even less >> when you measure temperatures in absolute scale (Kelvin) and it's half >> a degree in three hundred. > > Why on Earth do you think that the distance from nominal surface > temperatures to freezing much less absolute 0 is the right scale to compare > global warming changes against? You need to compare against the size of > global mean temperature changes that would cause large amounts of human > suffering, and that scale is on the order of a *few* degrees, not hundreds. > A change of half a degree over a few decades with no signs of slowing down > *should* be alarming. I didn't say what it should be; I gave three examples. And as I said, this is not the forum to debate climate change; I was just using it as an example of statistical reporting. Three types of lies. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-01-09 07:50 +0000 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <50ed2154$0$29898$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #36449 |
On Wed, 09 Jan 2013 07:14:51 +1100, Chris Angelico wrote: > Three types of lies. Oh, surely more than that. White lies. Regular or garden variety lies. Malicious lies. Accidental or innocent lies. FUD -- "fear, uncertainty, doubt". Half-truths. Lying by omission. Exaggeration and understatement. Propaganda. Misinformation. Disinformation. Deceit by emphasis. And manufactured doubt. E.g. the decades long campaign by the tobacco companies to deny that tobacco products caused cancer, when their own scientists were telling them that they did. Having learnt how valuable falsehoods are, those same manufacturers of doubt went on to sell their services to those who wanted to deny that CFCs destroyed ozone, and that CO2 causes warming. The old saw about "lies, damned lies and statistics" reminds me very much of a quote from Homer Simpson: "Pfff, facts, you can prove anything that's even remotely true with facts!" -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2013-01-08 22:59 +0000 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.296.1357685961.2939.python-list@python.org> |
| In reply to | #36403 |
On 08/01/2013 20:14, Chris Angelico wrote: > On Wed, Jan 9, 2013 at 2:55 AM, Robert Kern <robert.kern@gmail.com> wrote: >> On 08/01/2013 06:35, Chris Angelico wrote: >>> ... it looks >>> quite significant to show a line going from the bottom of the graph to >>> the top, but sounds a lot less noteworthy when you see it as a >>> half-degree increase on about (I think?) 30 degrees, and even less >>> when you measure temperatures in absolute scale (Kelvin) and it's half >>> a degree in three hundred. >> >> Why on Earth do you think that the distance from nominal surface >> temperatures to freezing much less absolute 0 is the right scale to compare >> global warming changes against? You need to compare against the size of >> global mean temperature changes that would cause large amounts of human >> suffering, and that scale is on the order of a *few* degrees, not hundreds. >> A change of half a degree over a few decades with no signs of slowing down >> *should* be alarming. > > I didn't say what it should be; Actually, you did. You stated that "a ~0.6 deg increase across ~30 years [is h]ardly statistically significant". Ignoring the confusion between statistical significance and practical significance (as external criteria like the difference between the nominal temp and absolute 0 or the right criteria that I mentioned has nothing to do with statistical significance), you made a positive claim that it wasn't significant. > I gave three examples. You gave negligently incorrect ones. Whether your comments were on topic or not, you deserve to be called on them when they are wrong. > And as I said, > this is not the forum to debate climate change; I was just using it as > an example of statistical reporting. > > Three types of lies. FUD is a fourth. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-01-07 22:32 +0000 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.239.1357597977.2939.python-list@python.org> |
| In reply to | #36368 |
On 7 January 2013 17:58, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote: > >> There are sometimes good reasons to get a line of best fit by eye. In >> particular if your data contains clusters that are hard to separate, >> sometimes it's useful to just pick out roughly where you think a line >> through a subset of the data is. > > Cherry picking subsets of your data as well as line fitting by eye? Two > wrongs do not make a right. It depends on what you're doing, though. I wouldn't use an eyeball fit to get numbers that were an important part of the conclusion of some or other study. I would very often use it while I'm just in the process of trying to understand something. > If you're going to just invent a line based on where you think it should > be, what do you need the data for? Just declare "this is the line I wish > to believe in" and save yourself the time and energy of collecting the > data in the first place. Your conclusion will be no less valid. An example: Earlier today I was looking at some experimental data. A simple model of the process underlying the experiment suggests that two variables x and y will vary in direct proportion to one another and the data broadly reflects this. However, at this stage there is some non-normal variability in the data, caused by experimental difficulties. A subset of the data appears to closely follow a well defined linear pattern but there are outliers and the pattern breaks down in an asymmetric way at larger x and y values. At some later time either the sources of experimental variation will be reduced, or they will be better understood but for now it is still useful to estimate the constant of proportionality in order to check whether it seems consistent with the observed values of z. With this particular dataset I would have wasted a lot of time if I had tried to find a computational method to match the line that to me was very visible so I chose the line visually. > > How do you distinguish between "data contains clusters that are hard to > separate" from "data doesn't fit a line at all"? > In the example I gave it isn't possible to make that distinction with the currently available data. That doesn't make it meaningless to try and estimate the parameters of the relationship between the variables using the preliminary data. > Even if the data actually is linear, on what basis could we distinguish > between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit > by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely > subjective judgement can be equally denied on the basis of subjective > judgement. It gets a bit easier if the line is constrained to go through the origin. You seem to be thinking that the important thing is proving that the line is "real", rather than identifying where it is. Both things are important but not necessarily in the same problem. In my example, the "real line" may not be straight and may not go through the origin, but it is definitely there and if there were no experimental problems then the data would all be very close to it. > Anyone can fool themselves into placing a line through a subset of non- > linear data. Or, sadly more often, *deliberately* cherry picking fake > clusters in order to fool others. Here is a real world example of what > happens when people pick out the data clusters that they like based on > visual inspection: > > http://www.skepticalscience.com/images/TempEscalator.gif > > And not linear by any means, but related to the cherry picking theme: > > http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif > > > To put it another way, when we fit patterns to data by eye, we can easily > fool ourselves into seeing patterns that aren't there, or missing the > patterns which are there. At best line fitting by eye is prone to honest > errors; at worst, it is open to the most deliberate abuse. We have eyes > and brains that evolved to spot the ripe fruit in trees, not to spot > linear trends in noisy data, and fitting by eye is not safe or > appropriate. This is all true. But the human brain is also in many ways much better than a typical computer program at recognising patterns in data when the data can be depicted visually. I would very rarely attempt to analyse data without representing it in some visual form. I also think it would be highly foolish to go so far with refusing to eyeball data that you would accept the output of some regression algorithm even when it clearly looks wrong. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-01-08 01:23 +0000 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <50eb7513$0$30003$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #36377 |
On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote: > An example: Earlier today I was looking at some experimental data. A > simple model of the process underlying the experiment suggests that two > variables x and y will vary in direct proportion to one another and the > data broadly reflects this. However, at this stage there is some > non-normal variability in the data, caused by experimental difficulties. > A subset of the data appears to closely follow a well defined linear > pattern but there are outliers and the pattern breaks down in an > asymmetric way at larger x and y values. At some later time either the > sources of experimental variation will be reduced, or they will be > better understood but for now it is still useful to estimate the > constant of proportionality in order to check whether it seems > consistent with the observed values of z. With this particular dataset I > would have wasted a lot of time if I had tried to find a computational > method to match the line that to me was very visible so I chose the line > visually. If you mean: "I looked at the data, identified that the range a < x < b looks linear and the range x > b does not, then used least squares (or some other recognised, objective technique for fitting a line) to the data in that linear range" then I'm completely cool with that. That's fine, with the understanding that this is the first step in either fixing your measurement problems, fixing your model, or at least avoiding extrapolation into the non-linear range. But that is not fitting a line by eye, which is what I am talking about. If on the other hand you mean: "I looked at the data, identified that the range a < x < b looked linear, so I laid a ruler down over the graph and pushed it around until I was satisfied that the ruler looked more or less like it fitted the data points, according to my guess of what counts as a close fit" that *is* fitting a line by eye, and it is entirely subjective and extremely dodgy for anything beyond quick and dirty back of the envelope calculations[1]. That's okay if all you want is to get something within an order of magnitude or so, or a line roughly pointing in the right direction, but that's all. [...] > I also think it would > be highly foolish to go so far with refusing to eyeball data that you > would accept the output of some regression algorithm even when it > clearly looks wrong. I never said anything of the sort. I said, don't fit lines to data by eye. I didn't say not to sanity check your straight line fit is reasonable by eyeballing it. [1] Or if your data is so accurate and noise-free that you hardly have to care about errors, since there clearly is one and only one straight line that passes through all the points. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-01-08 04:07 -0500 |
| Subject | Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] |
| Message-ID | <mailman.268.1357636078.2939.python-list@python.org> |
| In reply to | #36397 |
On 1/7/2013 8:23 PM, Steven D'Aprano wrote: > On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote: > >> An example: Earlier today I was looking at some experimental data. A >> simple model of the process underlying the experiment suggests that two >> variables x and y will vary in direct proportion to one another and the >> data broadly reflects this. However, at this stage there is some >> non-normal variability in the data, caused by experimental difficulties. >> A subset of the data appears to closely follow a well defined linear >> pattern but there are outliers and the pattern breaks down in an >> asymmetric way at larger x and y values. At some later time either the >> sources of experimental variation will be reduced, or they will be >> better understood but for now it is still useful to estimate the >> constant of proportionality in order to check whether it seems >> consistent with the observed values of z. With this particular dataset I >> would have wasted a lot of time if I had tried to find a computational >> method to match the line that to me was very visible so I chose the line >> visually. > > > If you mean: > > "I looked at the data, identified that the range a < x < b looks linear > and the range x > b does not, then used least squares (or some other > recognised, objective technique for fitting a line) to the data in that > linear range" > > then I'm completely cool with that. If both x and y are measured values, then regressing x on y and y on x with give different answers and both will be wrong in that *neither* will be the best answer for the relationship between them. Oscar did not specify whether either was an experimentally set input variable. > But that is not fitting a line by eye, which is what I am talking about. With the line constrained to go through 0,0, a line eyeballed with a clear ruler could easily be better than either regression line, as a human will tend to minimize the deviations *perpendicular to the line*, which is the proper thing to do (assuming both variables are measured in the same units). -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web