commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] Inconsistent handling of insufficient data when computing correlations
Date Sat, 09 Nov 2013 03:14:09 GMT
On 11/8/13 1:27 PM, Phil Steitz wrote:
> On 11/8/13 4:35 AM, Matt Adereth wrote:
>> While writing the test cases for KendallsCorrelation, I discovered an
>> interesting behavior with SpearmansCorrelation that might be considered an
>> inconsistency.  SpearmansCorrelation.correlate() throws
>> MathIllegalArgumentException if the array length is less than 2, but
>> returns Double.NaN if the array contains multiple copies of a single value.
> The latter sounds like a bug, assuming you are using the default
> NaturalRanking rank transform.  Ties should be averaged and handled
> correctly in this case.  Please open a JIRA, ideally with test case
> for this.

Does not actually look like a bug, at least I have not been able to
reproduce it.  You do get NaN when there are not at least two
distinct values in the x array (the first array to be correlated). 
That does need to be documented (as it is in SimpleRegression).

Phil
>
>> This seems inconsistent with how insufficient data is handled elsewhere in
>> Apache Commons Math.
> Good point.  I think there is justification for the different
> behavior here though.  SimpleRegression and the univariate stats are
> mutable, maintaining a dataset that can be added to, with stats
> queried at any point.  So while in theory, getSlope() in
> SimpleRegression could throw IllegalStateException (IAE not really
> appropriate here) when there is not enough data in the model, its
> documented behavior in this case is to return NaN.  The key is to
> clearly document the behavior.  SimpleRegression does this well, the
> correlation classes not so much.  Patches welcome to improve the
> documentation of preconditions and behavior of these classes.  I
> would be OK with changing the correlation classes to return NaNs in
> place of throwing IAE on insufficient data; but this change should
> happen in a major release (i.e. wait for 4.0).
>
> Phil
>
>  
>> In the User Guide for SimpleRegression it says:
>>
>>> When there are fewer than two observations in the model, or when there is
>> no variation in the x values (i.e. all x values are the same) all
>> statistics return NaN. At least two observations with different x
>> coordinates are required to estimate a bivariate regression model.
>>
>> Similarly, all the UnivariateStatistics return Double.NaN when there isn't
>> enough data.
>>
>> When I'm computing various statistics on multiple datasets, it seems
>> unnecessarily cumbersome to specially handle an exception for statistic and
>> NaNs for the others.  I propose that PearsonsCorrelation and
>> SpearmansCorrelation should return NaN if there is insufficient data,
>> whether it be from not enough observations (< 2) or not enough unique
>> values.
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message