commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz" <p...@steitz.com>
Subject Re: [math] proposed ordering for task list, scope of initial release
Date Sun, 08 Jun 2003 03:21:51 GMT

>>* Improve numerical accuracy of Univariate and BivariateRegression
>>statistical
>>computations. Encapsulate basic double[] |-> double mean, variance, min, max
>>computations using improved formulas and add these to MathUtils. (probably
>>should add float[], int[], long[] versions as well.) Then refactor all
>>univariate implementations that use stored values (including UnivariateImpl
>>with finite window) to use the improved versions. -- Mark?  I am chasing down
>>the TAS reference to document the source of the _NR_ formula, which I will
>>add
>>to the docs if someone else does the implementation.
> 
> 
> I was starting to code the updating (storage-less) variance formula, based on
> the Stanford article you cited, as a patch.  I believe the storage-using
> corrected two-pass algorithm is pretty trivial to code once we feel we're on
> solid ground with the reference to cite.
> 
> 
OK. I finally got hold of the American Statistician article (had to 
resort to the old trundle down to local university library method) and 
found lots of good stuff in it -- including a reference to Hanson's 
recursive formula (from Stanford paper) and some empirical and 
theoretical results confirming that NR 14.1.8 is about the best that you 
can do for the stored case.  There is a refinement mentioned in which 
"pairwise summation" is used (essentially splitting the sample in two 
and computing the recursive sums in parallel); but the value of this 
only kicks in for large n.  I propose that we use NR 14.1.8 as is for 
all stored computations.  Here is good text for the reference:

Based on the <i>corrected two-pass algorithm</i> for computing the 
sample variance, as described in "Algorithms for Computing the Sample 
Variance: Analysis and Recommendations",Tony F Chan, Gene H. Golub and 
Randall J. LeVeque, <i>The American Statitistician</i>, 1983, Vol 37, 
No. 3. (Eq. (1.7) on page 243.)

The empirical investigation that the authors do uses the following trick 
that I have thought about using to investigate the precision in our 
stuff:  implement an algorithm using both floats and doubles and use the 
double computations to assess stability of the algorithm implemented 
using floats. Might want to play with this a little.

Phil


> 



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message