# commons-dev mailing list archives

##### Site index · List index
Message view
Top
From "Phil Steitz" <p...@steitz.com>
Subject Re: [math] UnivariateImpl - when sumsq ~ xbar*xbar*((double) n)
Date Tue, 03 Jun 2003 03:06:17 GMT
```mdiggory@latte.harvard.edu wrote:
> My current work on Certified values exposes a limitation in calculating
> Variance in UnivariateImpl.
>
> In the following case where this example data is used to induce roundoff
> error:
>
> public double getVariance() {
>    double variance = Double.NaN;
>    if( n == 1 ) {
>       variance = 0.0;
>    } else if( n > 1 ) {
>       double xbar = getMean();
>       variance = sumsq - xbar*xbar*((double) n) /
>                  ((double)(n-1));
>    }
>    return variance;
> }
>
> Basically when "sumsq" approaches "xbar*xbar*((double) n)". the
> numerator in the variance calc can become negative and the return from
> variance is negative, Math.sqrt then goes on to return a NaN. Of course,
> negative variances are not in the domain of possible return values for
> variance. The test I discover this on is in the NumAcc4.txt dataset that
> is now under /src/test/org/apache/commons/math/stat/data.
>
> this data set basically looks like the following:
>
>   10000000.2
>   10000000.1
>   10000000.3
>   10000000.1
>   10000000.3
>   10000000.1
>   10000000.3
>   10000000.1
>   10000000.3
>   ...
>
> I think that this different approach below handles extreme cases like this with
> more stability:
>
> public double getVariance() {
>    double variance = Double.NaN;
>
>    if( n == 1 ) {
>     variance = 0.0;
>    } else if( n > 1 ) {
>     variance =( sumsq - (Math.pow(sum,2)/(double)n ) / (double)(n-1);
>    }
>    return variance;
> }
>
> The roundoff error in "sum of squares" is always positively greater than
> that of the "square of the sum". Thus not inducing negative numbers and
> instability when dealing with extreme situations like this. I think that
> while one should try to avoid situations where roundoff errors arise,
> but dealing with such cases in a stable fashion should help to produce a
> robust algorithm.

Assuming that the unit tests all succeed and the accuracy against the
reference data sets is at least as good as the current implementation
for all examples, I am OK with this change, but not for the reasons that
you give.

Since xbar = sum/n, the change has no impact on the which sums are
computes sum**2/n.  The difference is that you are a) eliminating one
division by n and one multiplication by n (no doubt a good thing) and b)
replacing direct multiplication with pow(-,2). The second of these used
to be discouraged, but I doubt it makes any difference with modern
compilers.  I would suggest collapsing the denominators and doing just
one cast -- i.e., use

(1) variance = sumsq - sum * (sum/(double) (n * (n - 1)))

If

(2) variance = sumsq - (sum * sum)/(double) (n * (n - 1))) or

(3) variance = sumsq - Math.pow(sum,2)/(double) (n * (n - 1))) give

better accuracy, use one of them; but I would favor (1) since it will be
able to handle larger positive sums.

I would also recommend forcing getVariance() to return 0 if the result
is negative (which can happen in the right circumstances for any of
these formulas).

Phil

>
> I wanted to open this to discussion prior to committing this change.
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org