commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz" <p...@steitz.com>
Subject Re: [math] UnivariateImpl - when sumsq ~ xbar*xbar*((double) n)
Date Tue, 03 Jun 2003 03:06:17 GMT
mdiggory@latte.harvard.edu wrote:
> My current work on Certified values exposes a limitation in calculating 
> Variance in UnivariateImpl.
> 
> In the following case where this example data is used to induce roundoff 
> error:
> 
> public double getVariance() {
>    double variance = Double.NaN;
>    if( n == 1 ) {
>       variance = 0.0;
>    } else if( n > 1 ) {
>       double xbar = getMean();
>       variance = sumsq - xbar*xbar*((double) n) /
>                  ((double)(n-1));
>    }
>    return variance;
> }
> 
> Basically when "sumsq" approaches "xbar*xbar*((double) n)". the 
> numerator in the variance calc can become negative and the return from 
> variance is negative, Math.sqrt then goes on to return a NaN. Of course, 
> negative variances are not in the domain of possible return values for 
> variance. The test I discover this on is in the NumAcc4.txt dataset that 
> is now under /src/test/org/apache/commons/math/stat/data.
> 
> this data set basically looks like the following:
> 
>   10000000.2
>   10000000.1
>   10000000.3
>   10000000.1
>   10000000.3
>   10000000.1
>   10000000.3
>   10000000.1
>   10000000.3
>   ...
> 
> I think that this different approach below handles extreme cases like this with 
> more stability:
> 
> public double getVariance() {
>    double variance = Double.NaN;
> 
>    if( n == 1 ) {
>     variance = 0.0;
>    } else if( n > 1 ) {
>     variance =( sumsq - (Math.pow(sum,2)/(double)n ) / (double)(n-1);
>    }
>    return variance;
> }
> 
> The roundoff error in "sum of squares" is always positively greater than 
> that of the "square of the sum". Thus not inducing negative numbers and 
> instability when dealing with extreme situations like this. I think that 
> while one should try to avoid situations where roundoff errors arise, 
> but dealing with such cases in a stable fashion should help to produce a 
> robust algorithm.

Assuming that the unit tests all succeed and the accuracy against the 
reference data sets is at least as good as the current implementation 
for all examples, I am OK with this change, but not for the reasons that 
you give.

Since xbar = sum/n, the change has no impact on the which sums are 
computed or squared. Instead of (sum/n)*(sum/n)*n your change just 
computes sum**2/n.  The difference is that you are a) eliminating one 
division by n and one multiplication by n (no doubt a good thing) and b) 
replacing direct multiplication with pow(-,2). The second of these used 
to be discouraged, but I doubt it makes any difference with modern 
compilers.  I would suggest collapsing the denominators and doing just 
one cast -- i.e., use

(1) variance = sumsq - sum * (sum/(double) (n * (n - 1)))

If

(2) variance = sumsq - (sum * sum)/(double) (n * (n - 1))) or

(3) variance = sumsq - Math.pow(sum,2)/(double) (n * (n - 1))) give

better accuracy, use one of them; but I would favor (1) since it will be 
able to handle larger positive sums.

I would also recommend forcing getVariance() to return 0 if the result 
is negative (which can happen in the right circumstances for any of 
these formulas).

Phil

> 
> I wanted to open this to discussion prior to committing this change.
> Mark
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message