commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Al Chou <hotfusion...@yahoo.com>
Subject Re: [math] UnivariateImpl - when sumsq ~ xbar*xbar*((double) n)
Date Tue, 03 Jun 2003 13:39:59 GMT
--- mdiggory@latte.harvard.edu wrote:
> Phil Steitz wrote:
> > Phil Steitz wrote:
> > 
> >> mdiggory@latte.harvard.edu wrote:
> >>
> >>> Phil Steitz wrote:
> >>>
> >>>> Since xbar = sum/n, the change has no impact on the which sums are 
> >>>> computed or squared. Instead of (sum/n)*(sum/n)*n your change just 
> >>>> computes sum**2/n.  The difference is that you are a) eliminating 
> >>>> one division by n and one multiplication by n (no doubt a good 
> >>>> thing) and b) replacing direct multiplication with pow(-,2). The 
> >>>> second of these used to be discouraged, but I doubt it makes any 
> >>>> difference with modern compilers.  I would suggest collapsing the 
> >>>> denominators and doing just one cast -- i.e., use
> >>>>
> >>>> (1) variance = sumsq - sum * (sum/(double) (n * (n - 1)))
> >>>>
> >>>> If
> >>>>
> >>>> (2) variance = sumsq - (sum * sum)/(double) (n * (n - 1))) or
> >>>>
> >>>> (3) variance = sumsq - Math.pow(sum,2)/(double) (n * (n - 1))) give
> >>>>
> >>>> better accuracy, use one of them; but I would favor (1) since it 
> >>>> will be able to handle larger positive sums.
> >>>>
> >>>> I would also recommend forcing getVariance() to return 0 if the 
> >>>> result is negative (which can happen in the right circumstances for

> >>>> any of these formulas).
> >>>>
> >>>> Phil
> >>>
> >>>
> >>>
> >>>
> >>> collapsing is definitely good, but I'm not sure about these 
> >>> equations, from my experience, approaching (2) would look something 
> >>> more like
> >>>
> >>> variance = (((double)n)*sumsq - (sum * sum)) / (double) (n * (n - 1));
> >>>
> >>> see (5) in http://mathworld.wolfram.com/k-Statistic.html
> >>
> >>
> >>
> >> That formula is the formula for the 2nd k-statistic, which is *not* 
> >> the same as the sample variance.  The standard formula for the sample 
> >> variance is presented in equation (3) here: 
> >> http://mathworld.wolfram.com/SampleVariance.html or in any elementary 
> >> statistics text. Formulas (1)-(3) above (and the current 
> >> implementation) are all equivalent to the standard defintion.
> >>
> >> What you have above is not.  The relation between the variance and the 
> >> second k-statistic is presented in (9) on 
> >> http://mathworld.wolfram.com/k-Statistic.html
> > 
> > 
> > I just realized that I misread Wolfram's definitions. What he is 
> > defining as the 2nd k-statistic is the correct formula for the sample 
> > variance.  I am also missing some parenthesis above.  Your formula is 
> > correct.  Sorry.
> > 
> > Phil
> > 
> 
> No problem, I just wanted to make sure we're all on the same page, there are 
> always many times I am wrong too. 
> 
> Thanks for double checking,
> -Mark



Uh, did we drop the idea of using the "corrected two-pass" algorithm for the
variance in the non-rolling case?  I excerpted that thread below.

Al


Date: Mon, 26 May 2003 18:28:45 -0700 (PDT)
From: Al Chou <hotfusionman@yahoo.com>
Subject: [math] rolling formulas for statistics (was RE: Greetings from a
newcomer (b

--- Phil Steitz <phil@steitz.com> wrote:
> Al Chou wrote:
> > --- Phil Steitz <phil@steitz.com> wrote:
> > 
> >>Brent Worden wrote:
> >>>>Mark R. Diggory wrote:
> >>>There are easy formulas for skewness and kurtosis based on the central
> >>>moments which could be used for the stored, univariate implementations:
> >>>http://mathworld.wolfram.com/Skewness.html
> >>>http://mathworld.wolfram.com/Kurtosis.html
> >>>
> >>>As for the rolling implementations, there might be some more research
> >>>involved before using this method because of their memoryless property. 
> >>
> >>But
> >>>for starters, the sum and sumsq can easily be replaced with there central
> >>>moment counterparts, mean and variance. There are formulas that update
> >>those
> >>>metrics when a new value is added.  Weisberg's "Applied Linear Regression"
> >>>outlines two such updating formulas for mean and sum of squares which are
> >>>numerically superior to direct computation and the raw moment methods.
> >>
> >>Why exactly, are these numerically superior?  For what class of 
> >>examples? Looks like lots more operations to me, especially in the 
> >>UnivariateImpl case where the mean, variance are computed only when 
> >>demanded -- i.e., there is no requirement to generate mean[0], 
> >>mean[1],...etc.  I understand that adding (or better, swapping) 
> >>operations can sometimes add precision, but I am having a hard time 
> >>seeing exactly where the benefit is in this case, especially given the 
> >>amount of additional computation required.
> > 
> > 
> > _Numerical Recipes in C_, 2nd ed. p. 613
> > (http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf) explains:
> > "Many textbooks use the binomial theorem to expand out the definitions [of
> > statistical quantities] into sums of various powers of the data, ...[,] but
> > this can magnify the roundoff error by a large factor... .  A clever way to
> > minimize roundoff error, especially for large samples, is to use the
> corrected
> > two-pass algorithm [1]: First calculate x[bar, the mean of x], then
> calculate
> > [formula for variance in terms of x - xbar.] ..."
> > 
> > [1] "Algorithms for Computing the Sample Variance: Analysis and
> > Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American
> > Statistician, vol. 37, pp. 242?247.
> 
> Thany you, Al!
> 
> I am convinced by this for that at least for the variance and higher 
> order moments, whenever we actually store the values (that would include 
> UnivariateImpl with finite window), we should use the "corrected one [sic]
> pass" formula (14.1.8) from 
> http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf.  It is a clever 
> idea to explicitly correct for error in the mean.

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message