 mdiggory@latte.harvard.edu wrote:
> Phil Steitz wrote:
> > Phil Steitz wrote:
> >
> >> mdiggory@latte.harvard.edu wrote:
> >>
> >>> Phil Steitz wrote:
> >>>
> >>>> Since xbar = sum/n, the change has no impact on the which sums are
> >>>> computed or squared. Instead of (sum/n)*(sum/n)*n your change just
> >>>> computes sum**2/n. The difference is that you are a) eliminating
> >>>> one division by n and one multiplication by n (no doubt a good
> >>>> thing) and b) replacing direct multiplication with pow(,2). The
> >>>> second of these used to be discouraged, but I doubt it makes any
> >>>> difference with modern compilers. I would suggest collapsing the
> >>>> denominators and doing just one cast  i.e., use
> >>>>
> >>>> (1) variance = sumsq  sum * (sum/(double) (n * (n  1)))
> >>>>
> >>>> If
> >>>>
> >>>> (2) variance = sumsq  (sum * sum)/(double) (n * (n  1))) or
> >>>>
> >>>> (3) variance = sumsq  Math.pow(sum,2)/(double) (n * (n  1))) give
> >>>>
> >>>> better accuracy, use one of them; but I would favor (1) since it
> >>>> will be able to handle larger positive sums.
> >>>>
> >>>> I would also recommend forcing getVariance() to return 0 if the
> >>>> result is negative (which can happen in the right circumstances for
> >>>> any of these formulas).
> >>>>
> >>>> Phil
> >>>
> >>>
> >>>
> >>>
> >>> collapsing is definitely good, but I'm not sure about these
> >>> equations, from my experience, approaching (2) would look something
> >>> more like
> >>>
> >>> variance = (((double)n)*sumsq  (sum * sum)) / (double) (n * (n  1));
> >>>
> >>> see (5) in http://mathworld.wolfram.com/kStatistic.html
> >>
> >>
> >>
> >> That formula is the formula for the 2nd kstatistic, which is *not*
> >> the same as the sample variance. The standard formula for the sample
> >> variance is presented in equation (3) here:
> >> http://mathworld.wolfram.com/SampleVariance.html or in any elementary
> >> statistics text. Formulas (1)(3) above (and the current
> >> implementation) are all equivalent to the standard defintion.
> >>
> >> What you have above is not. The relation between the variance and the
> >> second kstatistic is presented in (9) on
> >> http://mathworld.wolfram.com/kStatistic.html
> >
> >
> > I just realized that I misread Wolfram's definitions. What he is
> > defining as the 2nd kstatistic is the correct formula for the sample
> > variance. I am also missing some parenthesis above. Your formula is
> > correct. Sorry.
> >
> > Phil
> >
>
> No problem, I just wanted to make sure we're all on the same page, there are
> always many times I am wrong too.
>
> Thanks for double checking,
> Mark
Uh, did we drop the idea of using the "corrected twopass" algorithm for the
variance in the nonrolling case? I excerpted that thread below.
Al
Date: Mon, 26 May 2003 18:28:45 0700 (PDT)
From: Al Chou <hotfusionman@yahoo.com>
Subject: [math] rolling formulas for statistics (was RE: Greetings from a
newcomer (b
 Phil Steitz <phil@steitz.com> wrote:
> Al Chou wrote:
> >  Phil Steitz <phil@steitz.com> wrote:
> >
> >>Brent Worden wrote:
> >>>>Mark R. Diggory wrote:
> >>>There are easy formulas for skewness and kurtosis based on the central
> >>>moments which could be used for the stored, univariate implementations:
> >>>http://mathworld.wolfram.com/Skewness.html
> >>>http://mathworld.wolfram.com/Kurtosis.html
> >>>
> >>>As for the rolling implementations, there might be some more research
> >>>involved before using this method because of their memoryless property.
> >>
> >>But
> >>>for starters, the sum and sumsq can easily be replaced with there central
> >>>moment counterparts, mean and variance. There are formulas that update
> >>those
> >>>metrics when a new value is added. Weisberg's "Applied Linear Regression"
> >>>outlines two such updating formulas for mean and sum of squares which are
> >>>numerically superior to direct computation and the raw moment methods.
> >>
> >>Why exactly, are these numerically superior? For what class of
> >>examples? Looks like lots more operations to me, especially in the
> >>UnivariateImpl case where the mean, variance are computed only when
> >>demanded  i.e., there is no requirement to generate mean[0],
> >>mean[1],...etc. I understand that adding (or better, swapping)
> >>operations can sometimes add precision, but I am having a hard time
> >>seeing exactly where the benefit is in this case, especially given the
> >>amount of additional computation required.
> >
> >
> > _Numerical Recipes in C_, 2nd ed. p. 613
> > (http://libwww.lanl.gov/numerical/bookcpdf/c141.pdf) explains:
> > "Many textbooks use the binomial theorem to expand out the definitions [of
> > statistical quantities] into sums of various powers of the data, ...[,] but
> > this can magnify the roundoff error by a large factor... . A clever way to
> > minimize roundoff error, especially for large samples, is to use the
> corrected
> > twopass algorithm [1]: First calculate x[bar, the mean of x], then
> calculate
> > [formula for variance in terms of x  xbar.] ..."
> >
> > [1] "Algorithms for Computing the Sample Variance: Analysis and
> > Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American
> > Statistician, vol. 37, pp. 242?247.
>
> Thany you, Al!
>
> I am convinced by this for that at least for the variance and higher
> order moments, whenever we actually store the values (that would include
> UnivariateImpl with finite window), we should use the "corrected one [sic]
> pass" formula (14.1.8) from
> http://libwww.lanl.gov/numerical/bookcpdf/c141.pdf. It is a clever
> idea to explicitly correct for error in the mean.
=====
Albert Davidson Chou
Get answers to Mac questions at http://www.MacMgrs.org/ .
__________________________________
Do you Yahoo!?
Yahoo! Calendar  Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

To unsubscribe, email: commonsdevunsubscribe@jakarta.apache.org
For additional commands, email: commonsdevhelp@jakarta.apache.org
