commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gil...@harfang.homelinux.org>
Subject Re: [math] correlation analysis with NaNs
Date Sun, 18 Nov 2012 22:53:23 GMT
On Sun, Nov 18, 2012 at 11:01:18PM +0100, Thomas Neidhart wrote:
> On 11/09/2012 11:14 PM, Phil Steitz wrote:
> > On 11/9/12 12:18 AM, Thomas Neidhart wrote:
> >> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <phil.steitz@gmail.com> wrote:
> >>
> >>> On 11/8/12 9:44 AM, Phil Steitz wrote:
> >>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
> >>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
> >>>>>> On 11/08/2012 02:01 PM, S├ębastien Brisard wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> 2012/11/8 Gilles Sadowski <gilles@harfang.homelinux.org>:
> >>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart
wrote:
> >>>>>>>>> Hi Patrick,
> >>>>>>>>>
> >>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> >>>>>>>>>> I agree that it would be nice to have a constructor
that allows
> >>> you to
> >>>>>>>>>> specific the ranking algorithm only.
> >>>> +1 - patches welcome.
> >>>>>>>>>> As far as NaN and the Spearman correlation,
maybe we should add a
> >>> default
> >>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception
would occur if
> >>> any NaN is
> >>>>>>>>>> encountered. R uses this treatment of missing
data and forces
> >>> users to
> >>>>>>>>>> choose how to handle it. If we implemented something
like listwise
> >>> or
> >>>>>>>>>> pairwise deletion it could be used in other
classes too. As such,
> >>> treatment
> >>>>>>>>>> of missing data should be part of a larger discussion
and handled
> >>> in a more
> >>>>>>>>>> comprehensive and systematic way.
> >>>> +1 to develop a strategy for representing how to represent and
> >>>> handle missing data (see below)
> >>>>>>>>> I think this additional option makes sense, but
I forward this
> >>>>>>>>> discussion to the dev mailing list where it is better
suited.
> >>>>>>>> I'm wary of having CM handle "missing" data.
> >>>>>>>> For one thing we'd have to define a "convention" to
represent
> >>> missing data.
> >>>>>>>> There is no good way to do that in Java. Using NaN for
this purpose
> >>> in a
> >>>>>>>> low-level library is not a good idea IMHO.
> >>>>>>>>
> >>>>>>> I agree with Gilles, here. If I remember correctly, R has
a special
> >>>>>>> value NA, or something similar, which differs from NaN.
> >>>>>>>> Then, any convention might not be
> >>>>>>>> suitable for some user applications, which would lead
such an
> >>> application's
> >>>>>>>> developer to filter the data anyway in order to change
his
> >>> representation to
> >>>>>>>> CM's representation. Rather that calling two redundant
filtering
> >>> codes, I'd
> >>>>>>>> rather assume that CM gets a clean input on which its
algorithm can
> >>> operate.
> >>>>>>>> As usual, the input is subjected to precondition checks,
and
> >>> exceptions are
> >>>>>>>> thrown if the data is not clean enough.
> >>>>>>>>
> >>>>>>>> In summary: data validation (in the sense of discarding
input)
> >>> should not be
> >>>>>>>> done _before_ calling CM routines.
> >>>>>>>>
> >>>>>>> +1.
> >>>>>> ok, I am now confused. First you say that CM should not be involved
in
> >>>>>> data cleaning, but then you state that data validation should
not be
> >>>>>> done before calling CM? May be there is a *not* too much?
> >>>>> Yes, you are right: I wrote the opposite of what I meant.
> >>>>> ---
> >>>>>   In summary: data validation (in the sense of discarding input)
should
> >>>>>   be done _before_ calling CM routines.
> >>>>> ---
> >>>>>
> >>>>>> I think the proposition from Patrick was to exactly do that:
throw an
> >>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
> >>>>>>
> >>>>>> The other thing is, that the NaNStrategy.REMOVED is broken,
so either
> >>> we
> >>>>>> fix is or deprecate it.
> >>>> That we should fix.  Please open a JIRA for this.  I assume you are
> >>>> talking about the implementation in NaturalRanking.
> >>>>> +1
> >>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could
fix
> >>> it.
> >>>>> However, if nobody could actually rely on this feature because it
is
> >>> broken,
> >>>>> I'd prefer to remove it.]
> >>>> There are two issues here.  One is specific to ranking algorithms.
> >>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
> >>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
> >>>> strategy is intended to represent removal of NaNs from the data to
> >>>> be ordered.  If it is not implemented correctly in NaturalRanking or
> >>>> other rankings that is a bug and needs to be fixed.
> >>> Sorry, I just reread Patrick's original mail.  IIUC, there is
> >>> nothing wrong with the implementation of NaNStrategy.REMOVED in
> >>> NaturalRanking or other implemented rankings.  The problem is how
> >>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
> >>> impl that should be fixed.  The correct fix is to throw out the
> >>> corresponding entry in the second array when REMOVED is the
> >>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> >>> setting that as the default is a good idea.  Patches welcome.
> >>>> The second issue is the more general one of how to represent and
> >>>> handle missing data.  I have always seen that as a limitation that
> >>>> we would eventually address on an algorithm by algorithm basis.
> >>>> Different algorithms can be configured to do different things when
> >>>> missing data are encountered.  It is not always possible or
> >>>> desirable to preprocess the data to "eliminate" or impute missing
> >>>> data.  Saying that we are just not going to deal with it is a
> >>>> limitation that I don't think we should impose.  I am would like to
> >>>> hear others' ideas about good ways to model missing data in Java.
> >> Hi Phil,
> >>
> >> ok I have created three new issues:
> >>
> >>  * MATH-891
> >>  * MATH-892
> >>  * MATH-893
> > 
> > Thanks!
> >>
> >> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
> >> the RankingAlgorithm interface a bit. Right now, it only takes as input a
> >> one-dimensional array. But in case of correlations, you have two input
> >> arrays. If you remove from one array the NaN values, you have no means to
> >> know at which index they have been removed to do the same with the other
> >> array.
> > 
> > Or you push that responsibility to the client - in this case
> > SpearmansCorrelation.   My first thought on how to fix the
> > Spearman's impl was to have it compare lengths of ranked / unranked
> > when invoked with the REMOVED NaN strategy and then scan the
> > original arrays when removals happen, adjusting the ranked arrays
> > accordingly.  
> 
> I thought about this a bit more, and I do not think it can be done
> safely on the client side (i.e. SpearmansCorrelation).
> 
> Consider the following case:
> 
>  x: [NaN, 1, 2]
>  y: [1, NaN, 2]
> 
> the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:
> 
>  x: [1, 2]
>  y: [1, 2]
> 
> on the client side, everything looks fine, but in fact we would
> correlate wrong data.
> 
> Additionally, on the client side, we have no means to know the actual
> NaNStrategy that is used, as it is hidden in the ranking algorithm.
> 
> Moreover, comparing with the original array may also not work, as the
> ranking algorithm may change the data, so alignment is not always possible.
> 
> >>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> >>> setting that as the default is a good idea.  Patches welcome.
> 
> The NaNStrategy.FAILED has been added already, shall we make it the
> default then, what do you think?

+1

Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message