This sort of problem can often be approached from a Bayesian point of view
with a result that is a bit more intuitive.
The basic idea for this is that the data are measurements that come from
some process that is parameterized. These parameters are sampled from some
very nonspecific distribution. The question then is given what we have
observed, what can we conclude about the likelihood of different parameter
values given our knowledge. This leads to a very natural definition of
confidence bounds and estimation. The entire approach is anathema to some,
but makes lots of intuitive sense vis a vis how normal humans use
probability as a concept and has very deep mathematical roots.
The philosophical problem can be illustrated easily. If I flip a coin and
hold it in my closed hand, you and I would both declare the probability of
heads to be 0.5 even though the physics of the situation make it clear that
the coin has a single state that simply happens to be unknown to us. If I
peek in my hand and we estimate the probabilities again, you will still say
0.5 and I will say 0 or 1, but definitely will not say 0.5. The only
change has been that I have gained information and thus probability as we
have been using it is clearly a subjective concept. Further, it is
admissible to use a probability distribution to describe a physical process
that actually only has a single value.
This can be extended to some more complex measurement of a physical state
where I cannot as easily open my hand. In such a system, each measurement
that we make decreases our uncertainty about the unknown state, but does
not necessarily eliminate that uncertainty. Treating that unknown state as
having a distribution makes no statement about whether the state has a
single value. Instead, it merely allows us to quantify our own state of
ignorance or, more hopefully, our knowledge.
One additionally important point, regardless of whether we want to admit a
definition of probability that is a measure of subjective knowledge, with
the very weak constraint of exchangeability, it can be shown that we can
behave *as*if* this were true and get optimal estimates that can be framed
in terms acceptable to frequentists who do not accept probability as
subjective.
Operationally, this leaves us with the question of how to implement this.
Whether the implementation involves sampling or not has no bearing on
whether it is correct. If sampling is convenient computationally to
provide numerical estimates, then so be it.
Likewise, if sampling is convenient for the purposes of testing an approach
to see who well the resulting estimates conform to something that we know
to be true, then sampling is a great thing.
These two kinds of sampling are separate questions from each other and
separate questions from how various kinds of estimates are computed and
what they mean.
So that is where I come from. Now to the problem at hand.
The problem of least squares fitting can be described as estimating the
parameters of a data generation process given observations.
The data generation process in question has a linear relationship between
the predictor (independent variables) and the target (dependent) variable.
In addition to this linear relationship, there is additive Gaussian noise
of unknown magnitude that perturbs the ideal value of the target variable
to be the observed value. Generally, we have little preconceived notions
about either the linear process or the noise process, but in some cases it
is useful to introduce domain knowledge here as a form of regularization.
To relate this formulation to commonly used terminology, the accuracy of
our estimation of the linear process is referred to as "standard error" and
the magnitude of the noise process is referred to as "standard deviation".
The accuracy of our estimate is nicely determined by the width of the
posterior estimate of the linear process and the magnitude of the noise
process is well described by the mean of that parameter over the posterior
distribution. For Gaussian noise processes, the values in formulae 34 and
35 are useful estimates of the former in the absence of regularization.
For multivariate problems, it can be very dangerous to estimate the
covariance matrix or the inverse of the same by the maximum likelihood
estimate since you have an excessive chance of catastrophically bad
estimates. There is an extensive literature on Bayesian approaches to this
problem.
I hope that this description doesn't rub folks the wrong way for being too
elementary. I thought it might help to get basic terms in the open since
it sounds like there are unstated assumptions in the current discussion.
On Sun, May 6, 2012 at 6:15 AM, Dimitri Pourbaix
<pourbaix@astro.ulb.ac.be>wrote:
> Sebastien,
>
> Hi Dimitri,
>> I'm obviously missing something in my litterature review. I did a new
>> MC simulation, with a much smaller number of observation points
>> (namely 3, to fit a straight line!!!). It turns out that the formula
>> you are advocating for is the best estimate of the standard deviation
>> of the parameters. Could you please explain why this fomula differs
>> from formulas (34) and (35) in
>> http://mathworld.wolfram.com/**LeastSquaresFitting.html<http://mathworld.wolfram.com/LeastSquaresFitting.html>
>> ?
>>
>
> First thing worth noting is Worlfram is wise enough to call 34 and 35
> standard error ... and not standard deviation!
>
> As Gilles and you have shown with your MC simulations, the standard
> deviation (sigma_i=sqrt(cov[i][i])) approximates by how much the fitted
> parameter can vary when several sets of 'observations' are sampled with
> the same error distribution. I wrote 'approximate' because the true
> standard deviation is not accessible, instead it is approximated as the
> inverse of Fisher information matrix which is directly related to the
> Hessian matrix. The relation between Fisher and the variance of the
> parameter is known as the RaoCramer bound.
>
> In the case of the standard error, the sample of observations is fixed
> and one wonders by how much one can change the parameters without
> changing the resulting normalized chi square too much. That is the
> role of s (eq. 32 on Wolfram). It should be noted that nowhere on
> that page there is the notion of error on the observations: the data
> are what they are and no alternative sampling should be considered.
>
> Please, have a look at
>
> http://en.wikipedia.org/wiki/**Standard_deviation<http://en.wikipedia.org/wiki/Standard_deviation>
> http://en.wikipedia.org/wiki/**Standard_error<http://en.wikipedia.org/wiki/Standard_error>
>
> for further details, especially the last section of the Standard_error
> page as it compares std. error and deviation.
>
> Regards,
> Dim.
> ****
> 
> Dimitri Pourbaix * Don't worry, be happy
> Institut d'Astronomie et d'Astrophysique * and CARPE DIEM.
> CP 226, office 2.N4.211, building NO *
> Universite Libre de Bruxelles * Tel : +322650.35.71
> Boulevard du Triomphe * Fax : +322650.42.26
> B1050 Bruxelles * NAC: HBZSC RG2Z6
> http://sb9.astro.ulb.ac.be/~**pourbaix<http://sb9.astro.ulb.ac.be/~pourbaix>
* mailto:
> pourbaix@astro.ulb.ac.**be <pourbaix@astro.ulb.ac.be>
>
> ****
> To unsubscribe, email: devunsubscribe@commons.**apache.org<devunsubscribe@commons.apache.org>
> For additional commands, email: devhelp@commons.apache.org
>
>
