commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] Inconsistent handling of insufficient data when computing correlations
Date Fri, 08 Nov 2013 21:27:50 GMT
On 11/8/13 4:35 AM, Matt Adereth wrote:
> While writing the test cases for KendallsCorrelation, I discovered an
> interesting behavior with SpearmansCorrelation that might be considered an
> inconsistency.  SpearmansCorrelation.correlate() throws
> MathIllegalArgumentException if the array length is less than 2, but
> returns Double.NaN if the array contains multiple copies of a single value.

The latter sounds like a bug, assuming you are using the default
NaturalRanking rank transform.  Ties should be averaged and handled
correctly in this case.  Please open a JIRA, ideally with test case
for this.

>
> This seems inconsistent with how insufficient data is handled elsewhere in
> Apache Commons Math.

Good point.  I think there is justification for the different
behavior here though.  SimpleRegression and the univariate stats are
mutable, maintaining a dataset that can be added to, with stats
queried at any point.  So while in theory, getSlope() in
SimpleRegression could throw IllegalStateException (IAE not really
appropriate here) when there is not enough data in the model, its
documented behavior in this case is to return NaN.  The key is to
clearly document the behavior.  SimpleRegression does this well, the
correlation classes not so much.  Patches welcome to improve the
documentation of preconditions and behavior of these classes.  I
would be OK with changing the correlation classes to return NaNs in
place of throwing IAE on insufficient data; but this change should
happen in a major release (i.e. wait for 4.0).

Phil

 
>
> In the User Guide for SimpleRegression it says:
>
>> When there are fewer than two observations in the model, or when there is
> no variation in the x values (i.e. all x values are the same) all
> statistics return NaN. At least two observations with different x
> coordinates are required to estimate a bivariate regression model.
>
> Similarly, all the UnivariateStatistics return Double.NaN when there isn't
> enough data.
>
> When I'm computing various statistics on multiple datasets, it seems
> unnecessarily cumbersome to specially handle an exception for statistic and
> NaNs for the others.  I propose that PearsonsCorrelation and
> SpearmansCorrelation should return NaN if there is insufficient data,
> whether it be from not enough observations (< 2) or not enough unique
> values.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message