commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] correlation analysis with NaNs
Date Thu, 08 Nov 2012 17:44:02 GMT
On 11/8/12 8:23 AM, Gilles Sadowski wrote:
> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>> On 11/08/2012 02:01 PM, S├ębastien Brisard wrote:
>>> Hi,
>>> 2012/11/8 Gilles Sadowski <>:
>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>> Hi Patrick,
>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>> I agree that it would be nice to have a constructor that allows you
>>>>>> specific the ranking algorithm only.
+1 - patches welcome.
>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
any NaN is
>>>>>> encountered. R uses this treatment of missing data and forces users
>>>>>> choose how to handle it. If we implemented something like listwise
>>>>>> pairwise deletion it could be used in other classes too. As such,
>>>>>> of missing data should be part of a larger discussion and handled
in a more
>>>>>> comprehensive and systematic way.
+1 to develop a strategy for representing how to represent and
handle missing data (see below)
>>>>> I think this additional option makes sense, but I forward this
>>>>> discussion to the dev mailing list where it is better suited.
>>>> I'm wary of having CM handle "missing" data.
>>>> For one thing we'd have to define a "convention" to represent missing data.
>>>> There is no good way to do that in Java. Using NaN for this purpose in a
>>>> low-level library is not a good idea IMHO.
>>> I agree with Gilles, here. If I remember correctly, R has a special
>>> value NA, or something similar, which differs from NaN.
>>>> Then, any convention might not be
>>>> suitable for some user applications, which would lead such an application's
>>>> developer to filter the data anyway in order to change his representation
>>>> CM's representation. Rather that calling two redundant filtering codes, I'd
>>>> rather assume that CM gets a clean input on which its algorithm can operate.
>>>> As usual, the input is subjected to precondition checks, and exceptions are
>>>> thrown if the data is not clean enough.
>>>> In summary: data validation (in the sense of discarding input) should not
>>>> done _before_ calling CM routines.
>>> +1.
>> ok, I am now confused. First you say that CM should not be involved in
>> data cleaning, but then you state that data validation should not be
>> done before calling CM? May be there is a *not* too much?
> Yes, you are right: I wrote the opposite of what I meant.
> ---
>   In summary: data validation (in the sense of discarding input) should
>   be done _before_ calling CM routines.
> ---
>> I think the proposition from Patrick was to exactly do that: throw an
>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>> The other thing is, that the NaNStrategy.REMOVED is broken, so either we
>> fix is or deprecate it.

That we should fix.  Please open a JIRA for this.  I assume you are
talking about the implementation in NaturalRanking.
> +1
> [I mean (I think): If people rely on CM's removal of NaNs, we could fix it.
> However, if nobody could actually rely on this feature because it is broken,
> I'd prefer to remove it.]

There are two issues here.  One is specific to ranking algorithms. 
To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
the result has to be a total ordering.  The NaNStrategy.REMOVED
strategy is intended to represent removal of NaNs from the data to
be ordered.  If it is not implemented correctly in NaturalRanking or
other rankings that is a bug and needs to be fixed.

The second issue is the more general one of how to represent and
handle missing data.  I have always seen that as a limitation that
we would eventually address on an algorithm by algorithm basis. 
Different algorithms can be configured to do different things when
missing data are encountered.  It is not always possible or
desirable to preprocess the data to "eliminate" or impute missing
data.  Saying that we are just not going to deal with it is a
limitation that I don't think we should impose.  I am would like to
hear others' ideas about good ways to model missing data in Java.


> Sorry for the confusion,
> Gilles
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message