Groups > comp.graphics.apps.gnuplot > #223 > unrolled thread

Fitting: How does gnuplot calculate the covariance matrix?

Started by	Ingo Thies <ingo.thies@gmx.de>
First post	2011-04-08 13:05 +0200
Last post	2011-04-11 22:13 +0200
Articles	14 — 3 participants

Back to article view | Back to comp.graphics.apps.gnuplot

  Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-08 13:05 +0200
    Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-08 17:18 +0200
      Re: Fitting: How does gnuplot calculate the covariance matrix? Hans-Bernhard Bröker <HBBroeker@t-online.de> - 2011-04-10 19:31 +0200
        Re: Fitting: How does gnuplot calculate the covariance matrix? Charles Allen <ca137tmp@earthlink.net> - 2011-04-10 20:14 -0500
          Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-14 15:41 +0200
        Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-11 21:42 +0200
          Re: Fitting: How does gnuplot calculate the covariance matrix? Hans-Bernhard Bröker <HBBroeker@t-online.de> - 2011-04-11 23:27 +0200
            Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-12 11:28 +0200
              Re: Fitting: How does gnuplot calculate the covariance matrix? Hans-Bernhard Bröker <HBBroeker@t-online.de> - 2011-04-16 00:03 +0200
                Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-16 18:06 +0200
        Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-14 15:07 +0200
        Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-15 13:47 +0200
    Re: Fitting: How does gnuplot calculate the covariance matrix? Hans-Bernhard Bröker <HBBroeker@t-online.de> - 2011-04-10 19:04 +0200
      Re: Fitting: How does gnuplot calculate the covariance matrix? Ingo Thies <ingo.thies@gmx.de> - 2011-04-11 22:13 +0200

#223 — Fitting: How does gnuplot calculate the covariance matrix?

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-08 13:05 +0200
Subject	Fitting: How does gnuplot calculate the covariance matrix?
Message-ID	<9088euFi3iU1@mid.individual.net>

Hi all,

I am trying to find out how exactly gnuplot calculates the covariance
matrix for a fitted function.

The background: I am still mistrusting the error ellipses one gets from
the eigenvalues. As already mentioned last year (aroung May or so), I
suspect that the resulting error contours are underestimated since the
best-fit chi^2 is assumed to be zero there, but it isn't.

I have compared the error ellipses with the true error contours from an
independent chi^2-2D-mapping routine, and found that the latter may be
considerably larger than the ones one would get from gnuplot's cov matrix.

However, basic information about covariance (e.g. from Wikipedia) are
misleading here, since they only deal with the covariance of measured
data themselves, not the errors. Also I didn't manage to get a clear
image from the sourcecode yet.

For the long term, it would be helpful to include the basic formulas and
the most important caveats and pitfalls (e.g. how may I trust the
results or not) into the gnuplot.pdf.

In our institute, there are people working with gnuplot and its fitting
algorihm, but lack of important information about how they are working
in detail.

Many thanks for any helpful reply!

-- 
Gruß,
      Ingo

[toc] | [next] | [standalone]

#224

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-08 17:18 +0200
Message-ID	<908n9pF2jaU1@mid.individual.net>
In reply to	#223

On 08.04.2011 13:05, I wrote:

> The background: I am still mistrusting the error ellipses one gets from
> the eigenvalues. As already mentioned last year (aroung May or so), I
> suspect that the resulting error contours are underestimated since the
> best-fit chi^2 is assumed to be zero there, but it isn't.

Another point seems to be even more problematic, that is the usage of
the "asymptotic standard error".

For example, one can use the 1-sigma error contour as a min-max
estimator for the error of a,b in case of a two-parameter fit. I have
done this for e.g. a linear fit, f(x)=a+b*x, and compared the results
from gnuplot with those from an independent approach. While the best-fit
values as well as the chi^2 and the a,b correlation values are the same
within the visibile digits (these aren't very much in gnuplot, though),
the error estimate from the 1-sig contour is typically by an order of
magnitude larger than the asymptotic standard error.

One could argue, the outline of the error contour overestimates the
error (since error ellipses are often highly elongated but slim along
their minor axis; especially for |cor_ab| \approx 1), but in the test
case, this would only reduce the a,b erros only by a factor of about 1/2.

The gnuplot.pdf states that the asymptotic error underestimates the true
error, but it wasn't clear to me up to now that it could do this by such
a vast amount.

Here are sample data I have been using for the test:

#x              y             dy
0.000000        1.078039      0.100000
0.200000        1.012137      0.100000
0.400000        0.994650      0.100000
0.600000        1.210933      0.100000
0.800000        1.228788      0.100000
1.000000        1.279018      0.100000
1.200000        1.510823      0.100000
1.400000        1.466704      0.100000
1.600000        1.523061      0.100000
1.800000        1.768180      0.100000
2.000000        1.894179      0.100000
2.200000        2.006980      0.100000
2.400000        1.994247      0.100000
2.600000        2.212952      0.100000
2.800000        2.123002      0.100000
3.000000        2.341036      0.100000
3.200000        2.321149      0.100000
3.400000        2.469557      0.100000
3.600000        2.310980      0.100000
3.800000        2.451646      0.100000
4.000000        2.652370      0.100000

Gnuplot finds:

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

a               = 0.948209         +/- 0.04083      (4.306%)
b               = 0.427096         +/- 0.01746      (4.089%)

correlation matrix of the fit parameters:

               a      b
a               1.000
b              -0.855  1.000

while an independent approach finds

a = 9.482095E-01 -/+-1.947044E-01, 1.947043E-01
b = 4.270960E-01 -/+-8.327496E-02, 8.327495E-02

chi^2_red = 0.939417
cor_ab    =-0.855398

i.e. the a error is +/- 0.195 instead of +/- 0.04, while the b error is
+/- 0.083 instead of +/- 0.017

Using the errors of each only for the other parameter at its best-fit
value, the errorbars are about +/- 0.1 and +/-0.05, respectively.

I did not do a large study of fitting functions, but the given results
suggest, in my opinion, that the asymptotic standard error should not be
used for an error discussion at all.

Instead, I would recommend to calculate the error ellipse from the
(eigenvalues of the) covariance matrix, corrected for non-zero minimum
chi^2, and use it as an (additional) error estimate.

Any opionions?

-- 
Gruß,
      Ingo

[toc] | [prev] | [next] | [standalone]

#227

From	Hans-Bernhard Bröker <HBBroeker@t-online.de>
Date	2011-04-10 19:31 +0200
Message-ID	<90e7s7F9faU1@mid.dfncis.de>
In reply to	#224

On 08.04.2011 17:18, Ingo Thies wrote:

> Another point seems to be even more problematic, that is the usage of
> the "asymptotic standard error".

That's actually not another point, it's exactly the same.

> Gnuplot finds:
>
> Final set of parameters            Asymptotic Standard Error
> =======================            ==========================
>
> a               = 0.948209         +/- 0.04083      (4.306%)
> b               = 0.427096         +/- 0.01746      (4.089%)
>
>
> correlation matrix of the fit parameters:
>
>                 a      b
> a               1.000
> b              -0.855  1.000
>
>
> while an independent approach finds
>
> a = 9.482095E-01 -/+-1.947044E-01, 1.947043E-01
> b = 4.270960E-01 -/+-8.327496E-02, 8.327495E-02

Those errors are _way_ too big.  Do yourself a favour and plot your data 
along with the model, using parameters modified by those errors:

[assuming 'set fit errorvar' active, and fit done:]

gnuplot> p 'fit.dat' u 1:2:3 w err, a*x+b w l
gnuplot> rep (a+a_err)*x+b+b_err, (a-a_err)*x+b-b_err
gnuplot> a_thies=0.195
gnuplot> b_thies=.083
gnuplot> rep (a+a_thies)*x+b+b_thies, (a-a_thies)*x+b-b_thies

You'll see that gnuplot's a_err and b_err yield a corridor somewhat 
tightly containing almost all data points and their errors, just like it 
should be.

Your independent approach defines a corridor that is way larger than it 
needs to be.  Particularly the error on 'a' leads to a massive 
overestimation of the data errors at large 'x'.  E.g. at x==4, the model 
with those parameter errors yields an interval of 2.6 +/- 0.8, while 
your data claimed 2.6 +/- 0.1

[toc] | [prev] | [next] | [standalone]

#228

From	Charles Allen <ca137tmp@earthlink.net>
Date	2011-04-10 20:14 -0500
Message-ID	<slrniq4lf3.4pm.ca137tmp@mobile.local>
In reply to	#227

> On 08.04.2011 17:18, Ingo Thies wrote:
>> Another point seems to be even more problematic, that is the usage of
>> the "asymptotic standard error".

Hans-Bernhard Bröker:
> Those errors are _way_ too big.  Do yourself a favour and plot your data 
> along with the model, using parameters modified by those errors:

Perhaps there is confusion about "standard error" (error on the mean)
versus standard deviation.  The factor of 1/sqrt(21) (or however many
data points there were) seems about right.  Gnuplot is reporting the
error on the mean, the other program may be reporting the standard
deviation.

-- 
Charles Allen

[toc] | [prev] | [next] | [standalone]

#257

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-14 15:41 +0200
Message-ID	<90obscFk42U1@mid.individual.net>
In reply to	#228

On 11.04.2011 03:14, Charles Allen wrote:

> Perhaps there is confusion about "standard error" (error on the mean)
> versus standard deviation.  The factor of 1/sqrt(21) (or however many
> data points there were) seems about right.  Gnuplot is reporting the
> error on the mean, the other program may be reporting the standard
> deviation.

Hmm, I have made some tests. If a,b are strongly correlated (happens
e.g. when the x values are far off the x0 in a formula like
f(x)=a+b*(x-x0), then the error contour of (a,b) is a strongly elongated
"banana" (an ellipse in approximation). In the given case, x0=0. Then,
if you vary a with fixed b=b_best and also vary b with fixed a=a_best,
the resulting errors are about the same as in gnuplot. If, however,
x=xmean (here =2), so that the a-b-correlation vanishes, these errors
increase to the full errors you get from the full extent of the error
ellipse. These are much larger then the errors from gnuplot.

But, tell me, if I have non-correlated a,b, thus an error circle, what
else then the upper and lower rim in both a and b direction of the
1-sigma-circle should I take as the standard deviation of a and b?

-- 
Gruß,
      Ingo

[toc] | [prev] | [next] | [standalone]

#234

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-11 21:42 +0200
Message-ID	<90h3ttF39uU1@mid.individual.net>
In reply to	#227

Am 2011-04-10 19:31, schrieb Hans-Bernhard Bröker:

>> a = 9.482095E-01 -/+-1.947044E-01, 1.947043E-01
>> b = 4.270960E-01 -/+-8.327496E-02, 8.327495E-02
>
> Those errors are _way_ too big. Do yourself a favour and plot your data
> along with the model, using parameters modified by those errors:

Well, I plotted the enveloping contours for all (a,b) sets on the 
1-sigma contour, and found that the resulting two curves nicely embrace 
the majority of data points, and the "error pipe" is about the same size 
as the errorbars. And along this 1-sigma contour a and b walk through 
the error intervals above.

In particular, the upper right figure in this poster has been created 
this way:

<http://www.iac.es/congreso/constellation10/media/posters/Thies_poster.pdf>

Please note that the error pipe is based on the 1 and 2 sigma level 
corresponding to chi^2-chi_min^2 rather than chi^2 itself.

>
> [assuming 'set fit errorvar' active, and fit done:]
>
> gnuplot> p 'fit.dat' u 1:2:3 w err, a*x+b w l

You mixed up a and b here; it is actually f(x)=a+b*x

> gnuplot> rep (a+a_err)*x+b+b_err, (a-a_err)*x+b-b_err
> gnuplot> a_thies=0.195
> gnuplot> b_thies=.083
> gnuplot> rep (a+a_thies)*x+b+b_thies, (a-a_thies)*x+b-b_thies
>
> You'll see that gnuplot's a_err and b_err yield a corridor somewhat
> tightly containing almost all data points and their errors, just like it
> should be.

The problem hier is that a and b are anti-correlated (the correlation 
coefficient in the lower-left of the matrix is negative), i.e. the 
larges b corresponds to the smallest a and vice-versa. The only viable 
way to plot the error pipe (or corridor) is to use the actual (a,b) 
tuples from the 1 (or 2, 3 etc.) sigma contour. Or, at least, combine 
the largest a with the smallest b:

[...]
gnuplot> rep a+a_thies+(b-b_thies)*x, a-a_thies+(b+b_thies)*x

-- 
Gruß,
       Ingo

[toc] | [prev] | [next] | [standalone]

#236

From	Hans-Bernhard Bröker <HBBroeker@t-online.de>
Date	2011-04-11 23:27 +0200
Message-ID	<4DA3725D.3040306@t-online.de>
In reply to	#234

On 11.04.2011 21:42, Ingo Thies wrote:
> Am 2011-04-10 19:31, schrieb Hans-Bernhard Bröker:
>
>>> a = 9.482095E-01 -/+-1.947044E-01, 1.947043E-01
>>> b = 4.270960E-01 -/+-8.327496E-02, 8.327495E-02

>> Those errors are _way_ too big. Do yourself a favour and plot your data
>> along with the model, using parameters modified by those errors:
>
> Well, I plotted the enveloping contours for all (a,b) sets on the
> 1-sigma contour,

So, was a comparison of the model function's values for parameter pairs 
(a,b), (a+sigma_a, b+sigma_b) and (a-sigma_a, b-sigma_b) at x=4.0 to the 
data value of y=2.65 +/- 0.1 part of that study?  Did you notice that 
you get a model corridor [after correcting my a<-->b mixup] from 2.13 to 
3.18, more than 5 times wider than the error in the data?

Or, to keep it short: did you even look at the plot I described?  What's 
actually so wrong with the error corridor you get from gnuplot alone?

> And along this 1-sigma contour a and b walk through
> the error intervals above.

So which parameters of the error ellipse did you give above: horizontal 
and vertical section at the center, or overall size of the enclosed, 
axis-aligned bounding box?

>> [assuming 'set fit errorvar' active, and fit done:]
>>
>> gnuplot> p 'fit.dat' u 1:2:3 w err, a*x+b w l
>
> You mixed up a and b here; it is actually f(x)=a+b*x

Yep.  Sorry about that.  But even with that mixup repaired, the message 
is still the same.  a_thies and b_thies are similar enough to each other 
for that...

As to using opposite-sign combinations of deviations on the parameters 
--- that's just as bad, but in a different way, since it completely 
underestimates the data errors in the middle of the region.

The errors you get from gnuplot achieve a much more sensible error band 
with a whole lot less hassle.

[toc] | [prev] | [next] | [standalone]

#240

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-12 11:28 +0200
Message-ID	<90ik9hF7p0U1@mid.individual.net>
In reply to	#236

On 11.04.2011 23:27, Hans-Bernhard Bröker wrote:

> So, was a comparison of the model function's values for parameter pairs
> (a,b), (a+sigma_a, b+sigma_b) and (a-sigma_a, b-sigma_b) at x=4.0 to the
> data value of y=2.65 +/- 0.1 part of that study?  Did you notice that
> you get a model corridor [after correcting my a<-->b mixup] from 2.13 to
> 3.18, more than 5 times wider than the error in the data?

Well, look at

http://www.astro.uni-bonn.de/~ithies/gnuplot/temp/l2envelope.eps

and the (a,b) contours (for 1,2 and 3 sigma)

http://www.astro.uni-bonn.de/~ithies/gnuplot/temp/l2contour.eps

A colleague actually used gnuplot's errorbars for a linear fit to a set
of about 100 data points. The resulting error contours were *tiny*, but
were OK with the independent approach.

> Or, to keep it short: did you even look at the plot I described?  What's
> actually so wrong with the error corridor you get from gnuplot alone?

My concern here ist that you set the y-offsets of the error lines rather
arbitrary. This approach assumes that the errors are smallest at the
lower end of the xrange, and largest at the upper end, since the error
corridor always opens for larger x. This may be true, if e.g. the errors
are Poisson errors, but it certainly does not have to.

The only honest way, I think, is to calculate a large sample of (a,b)
along the 1-sigma ellipse (or, more precisely, the 1-sigma contour
calculated in a full chi^2 test), and calculate the enveloping curves
for all these resulting functions.

BTW Did you even look at the poster figure I have linked to, where I
have done exactly this (for a small sample, however)?

>> And along this 1-sigma contour a and b walk through
>> the error intervals above.
> 
> So which parameters of the error ellipse did you give above: horizontal
> and vertical section at the center, or overall size of the enclosed,
> axis-aligned bounding box?

All (a,b) at small steps around the ellipse.

> As to using opposite-sign combinations of deviations on the parameters
> --- that's just as bad, but in a different way, since it completely
> underestimates the data errors in the middle of the region.

A first approach then is, just to connect the endpoints of each of these
opposite-sign lines, and use the resulting trapezium as error corridor.
However, this slightly overestimates the errors in the middle.

> The errors you get from gnuplot achieve a much more sensible error band
> with a whole lot less hassle.

This may be true for a moderate number of samples. However, for large
number of samples, as my colleague did, the error corridor is way too
small. By nearly an order of magnitude or so.

-- 
Gruß,
      Ingo

[toc] | [prev] | [next] | [standalone]

#269

From	Hans-Bernhard Bröker <HBBroeker@t-online.de>
Date	2011-04-16 00:03 +0200
Message-ID	<90rtluF8muU1@mid.dfncis.de>
In reply to	#240

On 12.04.2011 11:28, Ingo Thies wrote:

> My concern here ist that you set the y-offsets of the error lines rather
> arbitrary.

Where do you think I did anything arbitrary?  The y offset of the fittet 
line is a result of the fit, including an error.  So constructing the 
error corridor has to include varying the fitted parameter "y offset" 
across it's fitted error range.

> This approach assumes that the errors are smallest at the
> lower end of the xrange, and largest at the upper end, since the error
> corridor always opens for larger x.

Yes, that's what this particular model function does.  That was your 
choice, not mine.  A different model, e.g. a*(x-x0), might yield a 
different behaviour.

> BTW Did you even look at the poster figure I have linked to, where I
> have done exactly this (for a small sample, however)?

For the context of this discussion, that poster figure is really quite 
meaningless.  It has neither the same data, nor the same model function 
as what we've been discussing.

>>> And along this 1-sigma contour a and b walk through
>>> the error intervals above.

>> So which parameters of the error ellipse did you give above: horizontal
>> and vertical section at the center, or overall size of the enclosed,
>> axis-aligned bounding box?
>
> All (a,b) at small steps around the ellipse.

A misunderstanding.  I was asking: where in that ellipse did those 
numbers your quote as errors of a and b come from?

> This may be true for a moderate number of samples. However, for large
> number of samples, as my colleague did, the error corridor is way too
> small. By nearly an order of magnitude or so.

Let's see some actual test case to demonstrate that.

[toc] | [prev] | [next] | [standalone]

#270

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-16 18:06 +0200
Message-ID	<90tt4sFj9bU1@mid.individual.net>
In reply to	#269

Am 2011-04-16 00:03, schrieb Hans-Bernhard Bröker:

> Where do you think I did anything arbitrary?

The assumption that the largest a (i.e. a+a_err) corresponds to the 
largest b (b+b_err) is arbitrary and actually wrong if the a,b are 
anticorrelated. Rather, if b_max=b+b_err is given, the corresponding a' 
is minimizing chi^2 for this condition (see Numerical Recipes in 
C/Fortran, Chapter 15.6). This can be done by following the 
68.3%-probability ("1 sigma") contour until b becomes maximal. If 
anticorrelated, a' will be closer to a-a_err rather than a or even a+a_err.

My own fault just was to pick the wrong degree of freedom (dfree). While 
the goodness-of-fit estimation requires dfree=ndata-npar (=21-2=19 in 
our example), the confidence ellipse for combined (a,b) needs 
dfree=npar=2, and for separate a and b errors, dfree=1. Then the 
ellipses (or, more generally, contours, since they are not necessarily 
perfect ellipses) are smaller than the (a,b) confidence levels, but give 
you the position of an extreme parameter and its corresponding 
counterpart geometrically. If you have a copy of the Numerical Recipes 
in your library, Figure 15.6.4. gives a good illustration for what's 
going on.

After correcting the degrees of freedom and redoing the fit for 
chi^2=ndata-npar, the errors are now nearly identical to those of 
gnuplot, at least, if the error contours are near-elliptical. If they're 
not (i.e. errors are not Gaussian), the errors are larger, but probably 
less meaningful.

> Yes, that's what this particular model function does. That was your
> choice, not mine. A different model, e.g. a*(x-x0), might yield a
> different behaviour.

If the error corridors are calculated by following the confidence 
ellipses, they are indeed independent from the choide of x0, although 
the best-fit a and its errors change.

>> All (a,b) at small steps around the ellipse.
>
> A misunderstanding. I was asking: where in that ellipse did those
> numbers your quote as errors of a and b come from?

The extremal a-values minus a_bestfit are taken as a-errors, and the 
same with b.

-- 
Gruß,
       Ingo

[toc] | [prev] | [next] | [standalone]

#256

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-14 15:07 +0200
Message-ID	<90o9s9F5aiU1@mid.individual.net>
In reply to	#227

On 10.04.2011 19:31, Hans-Bernhard Bröker wrote:

> gnuplot> p 'fit.dat' u 1:2:3 w err, a*x+b w l
> gnuplot> rep (a+a_err)*x+b+b_err, (a-a_err)*x+b-b_err
> gnuplot> a_thies=0.195
> gnuplot> b_thies=.083
> gnuplot> rep (a+a_thies)*x+b+b_thies, (a-a_thies)*x+b-b_thies

I have modified the model function to minimize the correlation of a and
b. This has been done by substituting x by x-x0, where x0 is the mean of
all x (2 in this case). Here is the full script, to be applied to the
same data table ('fit.tab'). To enhance visibility, I have given
different colors for the different error types (blue for gnuplot's, red
for mine). Please look again at the errors. Are they still _way_ too
large? Or aren't gnuplot's errors now _way_ too small?

set fit errorvar
#x0=0.; a_thies=0.195; b_thies=0.083 # old fit
x0=2.; a_thies=0.101; b_thies=0.083 # new fit
f(x)=a+b*(x-x0)
set xrange [0:4]
set yrange [0.5:3]
fit [0:4] f(x) xfile u 1:2:3 via a,b
set key left
p a+b*(x-x0) w l lc rgb '#0000ff',\
  a-a_err+(b-b_err)*(x-x0) lc rgb '#007fff',\
  a+a_err+(b+b_err)*(x-x0) lc rgb '#007fff',\
  a-a_thies+(b-b_thies)*(x-x0) lc rgb '#ff0000',\
  a+a_thies+(b+b_thies)*(x-x0) lc rgb '#ff0000',\
  'fit.dat' u 1:2:3 w err lc rgb '#000000'

I would find it highly confusing if the error corridors depend on a
simple substitution. If one plotts the a,b error contours, and uses them
for the error estimate (i.e. walks around the 1-sigma contour in the a-b
space and creates the enveloping curve for all possible resulting
functions), then the "way-too-large" error for the slope b remains
constant. Only the error for the offset a changes, as expected.

-- 
Gruß,
      Ingo

[toc] | [prev] | [next] | [standalone]

#265

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-15 13:47 +0200
Message-ID	<90qpj4FrbmU1@mid.individual.net>
In reply to	#227

On 10.04.2011 19:31, Hans-Bernhard Bröker wrote:

>> a = 9.482095E-01 -/+-1.947044E-01, 1.947043E-01
>> b = 4.270960E-01 -/+-8.327496E-02, 8.327495E-02
> 
> Those errors are _way_ too big.  Do yourself a favour and plot your data
> along with the model, using parameters modified by those errors:

I have now found the reason for the large errorbars. I simply had used
the wron degree of freedom for calculating the 68.3% probability ("1
sigma") contours in the (a,b) space. It should be 2 (2 params) for the
contours, and even only 1 for a single contouf (cf Numerical Recipes
chapter 15.6), but I used ndata-2, which is only valid for the
goodness-of-fit estimate. Now the errors are about the same as gnuplot's
given a good fit (i.e. \chi^2 \approx ndata-2).

Gnuplot seems to assume a good fit always, so the a-b errors do not vary
if all sigma(y_i) are scaled by the same factor. Maybe this is a
reasonable assumption since otherwise the whole fit is in question.

I hope I could repair a bit of the confusion I caused with this thread ;-)

-- 
Gruß,
      Ingo

[toc] | [prev] | [next] | [standalone]

#226

From	Hans-Bernhard Bröker <HBBroeker@t-online.de>
Date	2011-04-10 19:04 +0200
Message-ID	<4DA1E330.20903@t-online.de>
In reply to	#223

On 08.04.2011 13:05, Ingo Thies wrote:

> I am trying to find out how exactly gnuplot calculates the covariance
> matrix for a fitted function.

It does it basically the same way any simple fitting program does it: 
directly from the covariance matrix.  What you're looking at are the 
roots of the diagonal elements of the covariance matrix.

The only major difference is that gnuplot doesn't treat the data errors 
as more believable than the actual distance between model and data. 
This is done by normalizing the covariances by reduced chisquare, 
effectively reducing the meaning of the data errors to that of _weights_ 
for individual data points.

So you get the same parameter errors (and correlations) from multiple 
fits to the same data, but with re-scaled errors:

	fit f(x) 'foo.dat' u 1:2:($3) via ...	
	fit f(x) 'foo.dat' u 1:2:($3*100) via ...
	fit f(x) 'foo.dat' u 1:2:($3/100) via ...

all yield the same fit.

> The background: I am still mistrusting the error ellipses one gets from
> the eigenvalues. As already mentioned last year (aroung May or so), I
> suspect that the resulting error contours are underestimated since the
> best-fit chi^2 is assumed to be zero there,but it isn't.

Nobody expects chi^2 to be zero.  The expectation is for the reduced 
chisquare (chisq/ndf) to be about 1.0.

> I have compared the error ellipses with the true error contours from an
> independent chi^2-2D-mapping routine, and found that the latter may be
> considerably larger than the ones one would get from gnuplot's cov matrix.

Just in case: please note gnuplot's output is a correlation matrix, not 
a covariance matrix.

[toc] | [prev] | [next] | [standalone]

#235

From	Ingo Thies <ingo.thies@gmx.de>
Date	2011-04-11 22:13 +0200
Message-ID	<90h5nvFh1hU1@mid.individual.net>
In reply to	#226

Am 2011-04-10 19:04, schrieb Hans-Bernhard Bröker:

> Nobody expects chi^2 to be zero. The expectation is for the reduced
> chisquare (chisq/ndf) to be about 1.0.

You might have misunderstood me. If you want to calculate the 1-sigma 
ellipse from the actual chi^2 in the (a,b) map (this cannot be done with 
gnuplot yet, but with a simple independent program), one has to 
transform the chi^2 into the exclusion probability (via the incomplete 
gamma function and the degree of freedom), and then transform this into 
sigma via the (inverse) error function.

But before you throw the chi^2 into this algorithm, you have to 
substract the chi_min^2, the chi^2 of the best-fit (a,b). I.e. "0 sigma" 
corresponds to chi^2=chi_min^2, not 0. Otherwise you get too small error 
contours, and for bad fits, the 1-sigma contour might not even exist.

I have to admit that I had indeed neglected to do this correction in 
Thies & Kroupa (2007), Figure 9. Plotted there are the 95% and 99% (i.e. 
2 and 2.6 sigma) contours for the fitted parameters of four initial mass 
functions (IMF). The fit to the Taurus IMF was relatively poor, with the 
probability being >95%, making the 2-sigma contour disappear without the 
correction mentioned above (however, the main result is not affected by 
this lapsus). This shows the consequence of missing this correction.

Here the paper on arxiv: <http://arxiv.org/abs/0708.1764>

What has all this to do with gnuplot?

Well, if the errors of the data are near-Gaussian, you can approximate 
the error contour in the (a,b) space by an ellipse. The axis vectors of 
this ellipse are, as far as I remember, the eigenvectors of the 
covariance matrix which is closely related to the correlation matrix by 
multiplying/dividing with/by the product of the standard errors of a and b.

However, the error corridor you get from walking around the error 
ellipse and picking (a,b) and use it in f(x), is too narrow. As far as I 
remember, it is similar to the 1-sigma (a,b) error contour if chi_min^2 
is not subtracted. This, the other way round, corresponds to too small 
a,b standard errors.


-- 
Gruß,
       Ingo

[toc] | [prev] | [standalone]

csiph-web

Fitting: How does gnuplot calculate the covariance matrix?

Contents

#223 — Fitting: How does gnuplot calculate the covariance matrix?

#224

#227

#228

#257

#234

#236

#240

#269

#270

#256

#265

#226

#235