commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] correlation analysis with NaNs
Date Mon, 19 Nov 2012 17:45:22 GMT
On 11/19/12 3:31 AM, Gilles Sadowski wrote:
> On Sun, Nov 18, 2012 at 09:27:41PM -0800, Phil Steitz wrote:
>> On 11/18/12 2:01 PM, Thomas Neidhart wrote:
>>> On 11/09/2012 11:14 PM, Phil Steitz wrote:
>>>> On 11/9/12 12:18 AM, Thomas Neidhart wrote:
>>>>> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <>
>>>>>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>>>>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>>>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart
>>>>>>>>> On 11/08/2012 02:01 PM, S├ębastien Brisard wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 2012/11/8 Gilles Sadowski <>:
>>>>>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas
Neidhart wrote:
>>>>>>>>>>>> Hi Patrick,
>>>>>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>>>>>> I agree that it would be nice to have
a constructor that allows
>>>>>> you to
>>>>>>>>>>>>> specific the ranking algorithm only.
>>>>>>> +1 - patches welcome.
>>>>>>>>>>>>> As far as NaN and the Spearman correlation,
maybe we should add a
>>>>>> default
>>>>>>>>>>>>> strategy of NaNStrategy.FAIL so that
an exception would occur if
>>>>>> any NaN is
>>>>>>>>>>>>> encountered. R uses this treatment of
missing data and forces
>>>>>> users to
>>>>>>>>>>>>> choose how to handle it. If we implemented
something like listwise
>>>>>> or
>>>>>>>>>>>>> pairwise deletion it could be used in
other classes too. As such,
>>>>>> treatment
>>>>>>>>>>>>> of missing data should be part of a larger
discussion and handled
>>>>>> in a more
>>>>>>>>>>>>> comprehensive and systematic way.
>>>>>>> +1 to develop a strategy for representing how to represent and
>>>>>>> handle missing data (see below)
>>>>>>>>>>>> I think this additional option makes sense,
but I forward this
>>>>>>>>>>>> discussion to the dev mailing list where
it is better suited.
>>>>>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>>>>>> For one thing we'd have to define a "convention"
to represent
>>>>>> missing data.
>>>>>>>>>>> There is no good way to do that in Java. Using
NaN for this purpose
>>>>>> in a
>>>>>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>>>> I agree with Gilles, here. If I remember correctly,
R has a special
>>>>>>>>>> value NA, or something similar, which differs from
>>>>>>>>>>> Then, any convention might not be
>>>>>>>>>>> suitable for some user applications, which would
lead such an
>>>>>> application's
>>>>>>>>>>> developer to filter the data anyway in order
to change his
>>>>>> representation to
>>>>>>>>>>> CM's representation. Rather that calling two
redundant filtering
>>>>>> codes, I'd
>>>>>>>>>>> rather assume that CM gets a clean input on which
its algorithm can
>>>>>> operate.
>>>>>>>>>>> As usual, the input is subjected to precondition
checks, and
>>>>>> exceptions are
>>>>>>>>>>> thrown if the data is not clean enough.
>>>>>>>>>>> In summary: data validation (in the sense of
discarding input)
>>>>>> should not be
>>>>>>>>>>> done _before_ calling CM routines.
>>>>>>>>>> +1.
>>>>>>>>> ok, I am now confused. First you say that CM should not
be involved in
>>>>>>>>> data cleaning, but then you state that data validation
should not be
>>>>>>>>> done before calling CM? May be there is a *not* too much?
>>>>>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>>>>>> ---
>>>>>>>>   In summary: data validation (in the sense of discarding
input) should
>>>>>>>>   be done _before_ calling CM routines.
>>>>>>>> ---
>>>>>>>>> I think the proposition from Patrick was to exactly do
that: throw an
>>>>>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>>>>>>> The other thing is, that the NaNStrategy.REMOVED is broken,
so either
>>>>>> we
>>>>>>>>> fix is or deprecate it.
>>>>>>> That we should fix.  Please open a JIRA for this.  I assume you
>>>>>>> talking about the implementation in NaturalRanking.
>>>>>>>> +1
>>>>>>>> [I mean (I think): If people rely on CM's removal of NaNs,
we could fix
>>>>>> it.
>>>>>>>> However, if nobody could actually rely on this feature because
it is
>>>>>> broken,
>>>>>>>> I'd prefer to remove it.]
>>>>>>> There are two issues here.  One is specific to ranking algorithms.
>>>>>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
>>>>>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
>>>>>>> strategy is intended to represent removal of NaNs from the data
>>>>>>> be ordered.  If it is not implemented correctly in NaturalRanking
>>>>>>> other rankings that is a bug and needs to be fixed.
>>>>>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>>>>>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>>>>>> NaturalRanking or other implemented rankings.  The problem is how
>>>>>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
>>>>>> impl that should be fixed.  The correct fix is to throw out the
>>>>>> corresponding entry in the second array when REMOVED is the
>>>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>>>>> setting that as the default is a good idea.  Patches welcome.
>>>>>>> The second issue is the more general one of how to represent
>>>>>>> handle missing data.  I have always seen that as a limitation
>>>>>>> we would eventually address on an algorithm by algorithm basis.
>>>>>>> Different algorithms can be configured to do different things
>>>>>>> missing data are encountered.  It is not always possible or
>>>>>>> desirable to preprocess the data to "eliminate" or impute missing
>>>>>>> data.  Saying that we are just not going to deal with it is a
>>>>>>> limitation that I don't think we should impose.  I am would like
>>>>>>> hear others' ideas about good ways to model missing data in Java.
>>>>> Hi Phil,
>>>>> ok I have created three new issues:
>>>>>  * MATH-891
>>>>>  * MATH-892
>>>>>  * MATH-893
>>>> Thanks!
>>>>> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
>>>>> the RankingAlgorithm interface a bit. Right now, it only takes as input
>>>>> one-dimensional array. But in case of correlations, you have two input
>>>>> arrays. If you remove from one array the NaN values, you have no means
>>>>> know at which index they have been removed to do the same with the other
>>>>> array.
>>>> Or you push that responsibility to the client - in this case
>>>> SpearmansCorrelation.   My first thought on how to fix the
>>>> Spearman's impl was to have it compare lengths of ranked / unranked
>>>> when invoked with the REMOVED NaN strategy and then scan the
>>>> original arrays when removals happen, adjusting the ranked arrays
>>>> accordingly.  
>>> I thought about this a bit more, and I do not think it can be done
>>> safely on the client side (i.e. SpearmansCorrelation).
>>> Consider the following case:
>>>  x: [NaN, 1, 2]
>>>  y: [1, NaN, 2]
>>> the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:
>>>  x: [1, 2]
>>>  y: [1, 2]
>>> on the client side, everything looks fine, but in fact we would
>>> correlate wrong data.
>>> Additionally, on the client side, we have no means to know the actual
>>> NaNStrategy that is used, as it is hidden in the ranking algorithm.
>>> Moreover, comparing with the original array may also not work, as the
>>> ranking algorithm may change the data, so alignment is not always possible
>>>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>>>>> setting that as the default is a good idea.  Patches welcome.
>>> The NaNStrategy.FAILED has been added already, shall we make it the
>>> default then, what do you think?
>> I think that is probably best, since what I was trying to do was a
>> poor man's strategy for missing data.  In the case above, I would
>> have the client eliminate both of the first two observations, so
>> there would not be enough data left, but this is hard to document
>> and implement and is really just a hack to support one missing data
>> scenario.
>> Now is as good a time as any to think about how to correctly
>> represent and handle missing data.  The unfortunate thing is that in
>> Java working with primitive doubles we are back to the old Fortran
>> days of having no natural representation of a missing value. 
>> Sticking with primitives, the only thing we can do is either use NaN
>> or allow the "missing" designator to be configured by the user.  I
>> am curious what others have done in this area.
> As you say, as I said, with primitive double, there is no value that can
> readily serve as "missing". It's a user's choice (e.g. "Double.NaN",
> "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative value", ...), that
> depends on the context.
>> The second question is what strategies do we support for handling
>> missing data and how do we represent those strategies.   The
>> simplest and easiest strategy to implement is to delete observations
>> that include missing data.  This is a data-only strategy and would
>> work the same way across algorithms.  I am afraid, however, that
>> this is the only strategy that is not algorithm-dependent (unless
>> you consider, e.g. EM as a missing data strategy or very simple
>> imputation strategies).  So that means individual algorithms need to
>> include missing data strategies in their specifications.  It might
>> be good to define and implement these for the correlation and
>> regression classes and see if we can generalize.  Any ideas on how
>> best to do this?
> I'm sorry if I'm dense, but I don't remember if or why the option that users
> should provide clean input data to CM has been ruled out.
> I.e. filtering (by user) is done before computation (by CM's algo).
> If the data is missing, how can you use it (to correlate, to fit, ...)?

There are multiple techniques that can be used to adjust for missing
data, depending on the algorithm.  See [1], for example, for a
summary of the kinds of techniques that can be used in regression. 
Basically, saying users need to adjust the data before providing it
to the algorithm allows only the "data only" approaches and may be
inconvenient or make impossible other analyses to be performed on
the same data.


> Regards,
> Gilles
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message