# commons-dev mailing list archives

##### Site index · List index
Message view
Top
From "Phil Steitz" <p...@steitz.com>
Subject Re: [math] proposed ordering for task list, scope of initial release
Date Sun, 08 Jun 2003 03:21:51 GMT
```
>>* Improve numerical accuracy of Univariate and BivariateRegression
>>statistical
>>computations. Encapsulate basic double[] |-> double mean, variance, min, max
>>computations using improved formulas and add these to MathUtils. (probably
>>should add float[], int[], long[] versions as well.) Then refactor all
>>univariate implementations that use stored values (including UnivariateImpl
>>with finite window) to use the improved versions. -- Mark?  I am chasing down
>>the TAS reference to document the source of the _NR_ formula, which I will
>>to the docs if someone else does the implementation.
>
>
> I was starting to code the updating (storage-less) variance formula, based on
> the Stanford article you cited, as a patch.  I believe the storage-using
> corrected two-pass algorithm is pretty trivial to code once we feel we're on
> solid ground with the reference to cite.
>
>
OK. I finally got hold of the American Statistician article (had to
resort to the old trundle down to local university library method) and
found lots of good stuff in it -- including a reference to Hanson's
recursive formula (from Stanford paper) and some empirical and
theoretical results confirming that NR 14.1.8 is about the best that you
can do for the stored case.  There is a refinement mentioned in which
"pairwise summation" is used (essentially splitting the sample in two
and computing the recursive sums in parallel); but the value of this
only kicks in for large n.  I propose that we use NR 14.1.8 as is for
all stored computations.  Here is good text for the reference:

Based on the <i>corrected two-pass algorithm</i> for computing the
sample variance, as described in "Algorithms for Computing the Sample
Variance: Analysis and Recommendations",Tony F Chan, Gene H. Golub and
Randall J. LeVeque, <i>The American Statitistician</i>, 1983, Vol 37,
No. 3. (Eq. (1.7) on page 243.)

The empirical investigation that the authors do uses the following trick
that I have thought about using to investigate the precision in our
stuff:  implement an algorithm using both floats and doubles and use the
double computations to assess stability of the algorithm implemented
using floats. Might want to play with this a little.

Phil

>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org