commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gil...@harfang.homelinux.org>
Subject Re: [math] correlation analysis with NaNs
Date Thu, 08 Nov 2012 16:23:46 GMT
On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
> On 11/08/2012 02:01 PM, S├ębastien Brisard wrote:
> > Hi,
> > 
> > 2012/11/8 Gilles Sadowski <gilles@harfang.homelinux.org>:
> >> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
> >>> Hi Patrick,
> >>>
> >>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> >>>> I agree that it would be nice to have a constructor that allows you
to
> >>>> specific the ranking algorithm only.
> >>>>
> >>>> As far as NaN and the Spearman correlation, maybe we should add a default
> >>>> strategy of NaNStrategy.FAIL so that an exception would occur if any
NaN is
> >>>> encountered. R uses this treatment of missing data and forces users
to
> >>>> choose how to handle it. If we implemented something like listwise or
> >>>> pairwise deletion it could be used in other classes too. As such, treatment
> >>>> of missing data should be part of a larger discussion and handled in
a more
> >>>> comprehensive and systematic way.
> >>>
> >>> I think this additional option makes sense, but I forward this
> >>> discussion to the dev mailing list where it is better suited.
> >>
> >> I'm wary of having CM handle "missing" data.
> >> For one thing we'd have to define a "convention" to represent missing data.
> >> There is no good way to do that in Java. Using NaN for this purpose in a
> >> low-level library is not a good idea IMHO.
> >>
> > I agree with Gilles, here. If I remember correctly, R has a special
> > value NA, or something similar, which differs from NaN.
> >>
> >> Then, any convention might not be
> >> suitable for some user applications, which would lead such an application's
> >> developer to filter the data anyway in order to change his representation to
> >> CM's representation. Rather that calling two redundant filtering codes, I'd
> >> rather assume that CM gets a clean input on which its algorithm can operate.
> >> As usual, the input is subjected to precondition checks, and exceptions are
> >> thrown if the data is not clean enough.
> >>
> >> In summary: data validation (in the sense of discarding input) should not be
> >> done _before_ calling CM routines.
> >>
> > +1.
> 
> ok, I am now confused. First you say that CM should not be involved in
> data cleaning, but then you state that data validation should not be
> done before calling CM? May be there is a *not* too much?

Yes, you are right: I wrote the opposite of what I meant.
---
  In summary: data validation (in the sense of discarding input) should
  be done _before_ calling CM routines.
---

> 
> I think the proposition from Patrick was to exactly do that: throw an
> exception if such invalid data is encountered (NaNStrategy.FAIL).
> 
> The other thing is, that the NaNStrategy.REMOVED is broken, so either we
> fix is or deprecate it.

+1
[I mean (I think): If people rely on CM's removal of NaNs, we could fix it.
However, if nobody could actually rely on this feature because it is broken,
I'd prefer to remove it.]


Sorry for the confusion,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message