# commons-dev mailing list archives

##### Site index · List index
Message view
Top
From "Phil Steitz" <p...@steitz.com>
Subject Re: [math] rolling formulas for statistics (was RE: Greetings from a newcomer (but not to numerics))
Date Mon, 26 May 2003 15:03:42 GMT
```Al Chou wrote:
> --- Phil Steitz <phil@steitz.com> wrote:
>
>>Brent Worden wrote:
>>
>>>>In fact the higher the moment you calculate (variance, skew, kurtosis)
>>>>on the below set of numbers, the greater the loss of precision, this is
>>>>because for values less than one
>>>>
>>>>sumquad << sumcube << sumsq << sum
>>>>
>>>>-Mark
>>>>
>>>>Mark R. Diggory wrote:
>>>
>>>
>>>And when the values are greater than one, we run the additional risk of
>>>overflow.  Also, because sumsq and n*xbar^2 are relatively large and
>>>relatively equal, subtracting the two, as done in computing variance,
>>>results in loss of precision as well.
>>>
>>>One possible way to limit these problems it by using central moments in
>>
>>lieu
>>
>>>of raw moments.  Since central moments are expected values, they tend to
>>>converge to a finite value as the sample size increases.  They only time
>>>they wouldn't converge is when the data is drawn from a distribution where
>>>those higher moments don't exist.
>>>
>>>There are easy formulas for skewness and kurtosis based on the central
>>>moments which could be used for the stored, univariate implementations:
>>>http://mathworld.wolfram.com/Skewness.html
>>>http://mathworld.wolfram.com/Kurtosis.html
>>>
>>>As for the rolling implementations, there might be some more research
>>>involved before using this method because of their memoryless property.
>>
>>But
>>
>>>for starters, the sum and sumsq can easily be replaced with there central
>>>moment counterparts, mean and variance. There are formulas that update
>>
>>those
>>
>>>metrics when a new value is added.  Weisberg's "Applied Linear Regression"
>>>outlines two such updating formulas for mean and sum of squares which are
>>>numerically superior to direct computation and the raw moment methods.
>>>
>>
>>Why exactly, are these numerically superior?  For what class of
>>examples? Looks like lots more operations to me, especially in the
>>UnivariateImpl case where the mean, variance are computed only when
>>demanded -- i.e., there is no requirement to generate mean[0],
>>mean[1],...etc.  I understand that adding (or better, swapping)
>>operations can sometimes add precision, but I am having a hard time
>>seeing exactly where the benefit is in this case, especially given the
>
>
> _Numerical Recipes in C_, 2nd ed. p. 613
> (http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf) explains:
> "Many textbooks use the binomial theorem to expand out the definitions [of
> statistical quantities] into sums of various powers of the data, ...[,] but
> this can magnify the roundoff error by a large factor... .  A clever way to
> minimize roundoff error, especially for large samples, is to use the corrected
> two-pass algorithm [1]: First calculate x[bar, the mean of x], then calculate
> [formula for variance in terms of x - xbar.] ..."
>
> [1] "Algorithms for Computing the Sample Variance: Analysis and
> Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American
> Statistician, vol. 37, pp. 242?247.
>
> Unfortunately, Google could not find the above article online, though it is
> cited by articles in CiteSeer, which led to some more generally interesting
> resources listed below.  The corrected two-pass algorithm is only given for a
> non-rolling dataset in _NR_, though Brent may have derived a rolling version in
> his last post in this thread (I'm not sure at a glance, I'd have to get out
> pencil and paper to check).
>
> * "Numerical Analysis", Gordon K. Smyth, May 1997
> http://citeseer.nj.nec.com/91928.html
> Abstract:  This article discusses the basic ideas of accuracy and describes
> briefly key topics in numerical analysis which are treated at more depth in
> separate articles.  Pointers to available software are given at the end of the
> article.
>
> * Bibliography of Accuracy and Stability of Numerical Algorithms, ...
> http://www.ma.man.ac.uk/~higham/asna/asna2.pdf
>

Thany you, Al!

I am convinced by this for that at least for the variance and higher
order moments, whenever we actually store the values (that would include
UnivariateImpl with finite window), we should use the "corrected one
pass" formula (14.1.8) from
http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf.  It is a clever
idea to explicitly correct for error in the mean.

This could easily be patched into AbstractStoreUnivariate.  We will have
to think a little bit about how to incorporate it into UnivariateImpl
for just the finite window case.

One point on using _NR_.  I am OK with grabbing formula (14.1.8), since
there is a reference attributing the source of the formula to an AmStat
article, which means that the _NR_ cannot claim ownership of the
formula.  As long as we include attribution to the AmStat authors, I see
no problem including it.  We need to be careful about lifting anything
directly from _NR_ (or anywhere else for that matter) where we do not
have a "generally available, no explicit license" expression of the
algorithm/formula in question.  My philosophy on this sort of thing is
that 99% of the algorithms of real practical value are in the public
domain and as long as we concentrate on implementing *algorithms*
instead of imitating/translating/stealing code, we will be OK.  For
example, Brent's Cholesky decomp is his own implementation, based on his
understanding of the algorithm so this is OK. Cutting and pasting from
Jama would not be OK (even though Jama is in fact in the public domain).
Similarly for my use of inversion, etc to turn uniform in to
exponential, poission deviates, etc.or anything else that I have
submitted.  Fortunately, almost all of the really useful things in
mathematics were developed before all of this "IP license-mania" stuff
took root. I don't worry too much about the descendents of Gauss,
Lagrange, et al coming after us, despite the fact that their claims
would have *much* more merit than most of what people are extracting
licensing fees for today.

I am still unsure about how the "inductive" formula below would improve
things for the no-memory, infinite window version of univariate. It
still looks like a lot of extra computation to me for questionable gain.
Note that to emulate the formula above, you might want to (after some
predetermined initial number of iterates) subtract the (changing)
correction factor from (14.1.8).  This would certainly have an
interesting impact on the mean estimate :-).  Might be a good idea to
dig up the AmStat article and see what impact this would have on the
arguments there.  In any case, I would appreciate any ideas that you
have on how exactly the inductive sums below would improve accuracy in
the no-memory case.

Thanks for looking into this.

Phil

>
>
>>>mean[0] = 0
>>>mean[m + 1] = mean[m] + ((1 / (m + 1)) * (x[m + 1] - mean[m]))
>>>
>>>ss[0] = 0
>>>ss[m + 1] = ss[m] + ((m / (m + 1)) * (x[m + 1] - mean[m])^2)
>>>
>>>where mean[m] is the mean for the first m values
>>>      x[m] is the m-th value
>>>  and ss[m] is the sum of squares for the first m values
>>>
>>>The sum of squares formula could then be used to derive a similar formula
>>>for variance:
>>>
>>>var[0] = 0.0
>>>var[m + 1] = ((m - 1) / m) * var[m] + ((1 / (m + 1)) * (x[m + 1] -
>>>mean[m])^2)
>>>
>>>where var[m] is the sample variance for the first m values
>>
>
>
> =====
> Albert Davidson Chou
>
>     Get answers to Mac questions at http://www.Mac-Mgrs.org/ .
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo.
> http://search.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org