mdiggory@latte.harvard.edu wrote:
> My current work on Certified values exposes a limitation in calculating
> Variance in UnivariateImpl.
>
> In the following case where this example data is used to induce roundoff
> error:
>
> public double getVariance() {
> double variance = Double.NaN;
> if( n == 1 ) {
> variance = 0.0;
> } else if( n > 1 ) {
> double xbar = getMean();
> variance = sumsq  xbar*xbar*((double) n) /
> ((double)(n1));
> }
> return variance;
> }
>
> Basically when "sumsq" approaches "xbar*xbar*((double) n)". the
> numerator in the variance calc can become negative and the return from
> variance is negative, Math.sqrt then goes on to return a NaN. Of course,
> negative variances are not in the domain of possible return values for
> variance. The test I discover this on is in the NumAcc4.txt dataset that
> is now under /src/test/org/apache/commons/math/stat/data.
>
> this data set basically looks like the following:
>
> 10000000.2
> 10000000.1
> 10000000.3
> 10000000.1
> 10000000.3
> 10000000.1
> 10000000.3
> 10000000.1
> 10000000.3
> ...
>
> I think that this different approach below handles extreme cases like this with
> more stability:
>
> public double getVariance() {
> double variance = Double.NaN;
>
> if( n == 1 ) {
> variance = 0.0;
> } else if( n > 1 ) {
> variance =( sumsq  (Math.pow(sum,2)/(double)n ) / (double)(n1);
> }
> return variance;
> }
>
> The roundoff error in "sum of squares" is always positively greater than
> that of the "square of the sum". Thus not inducing negative numbers and
> instability when dealing with extreme situations like this. I think that
> while one should try to avoid situations where roundoff errors arise,
> but dealing with such cases in a stable fashion should help to produce a
> robust algorithm.
Assuming that the unit tests all succeed and the accuracy against the
reference data sets is at least as good as the current implementation
for all examples, I am OK with this change, but not for the reasons that
you give.
Since xbar = sum/n, the change has no impact on the which sums are
computed or squared. Instead of (sum/n)*(sum/n)*n your change just
computes sum**2/n. The difference is that you are a) eliminating one
division by n and one multiplication by n (no doubt a good thing) and b)
replacing direct multiplication with pow(,2). The second of these used
to be discouraged, but I doubt it makes any difference with modern
compilers. I would suggest collapsing the denominators and doing just
one cast  i.e., use
(1) variance = sumsq  sum * (sum/(double) (n * (n  1)))
If
(2) variance = sumsq  (sum * sum)/(double) (n * (n  1))) or
(3) variance = sumsq  Math.pow(sum,2)/(double) (n * (n  1))) give
better accuracy, use one of them; but I would favor (1) since it will be
able to handle larger positive sums.
I would also recommend forcing getVariance() to return 0 if the result
is negative (which can happen in the right circumstances for any of
these formulas).
Phil
>
> I wanted to open this to discussion prior to committing this change.
> Mark
>
> 
> To unsubscribe, email: commonsdevunsubscribe@jakarta.apache.org
> For additional commands, email: commonsdevhelp@jakarta.apache.org
>

To unsubscribe, email: commonsdevunsubscribe@jakarta.apache.org
For additional commands, email: commonsdevhelp@jakarta.apache.org
