Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 294A3D8BC for ; Thu, 8 Nov 2012 13:02:01 +0000 (UTC) Received: (qmail 13812 invoked by uid 500); 8 Nov 2012 13:02:00 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 13640 invoked by uid 500); 8 Nov 2012 13:02:00 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 13619 invoked by uid 99); 8 Nov 2012 13:01:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 13:01:59 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of SRS0=lxqC=JE=m4x.org=sebastien.brisard@bounces.m4x.org designates 129.104.30.34 as permitted sender) Received: from [129.104.30.34] (HELO mx1.polytechnique.org) (129.104.30.34) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 13:01:53 +0000 Received: from mail-qc0-f171.google.com (mail-qc0-f171.google.com [209.85.216.171]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ssl.polytechnique.org (Postfix) with ESMTPSA id 87D91140C558A for ; Thu, 8 Nov 2012 14:01:30 +0100 (CET) Received: by mail-qc0-f171.google.com with SMTP id d1so197511qca.30 for ; Thu, 08 Nov 2012 05:01:29 -0800 (PST) MIME-Version: 1.0 Received: by 10.224.33.139 with SMTP id h11mr11507849qad.89.1352379689468; Thu, 08 Nov 2012 05:01:29 -0800 (PST) Received: by 10.49.86.232 with HTTP; Thu, 8 Nov 2012 05:01:29 -0800 (PST) In-Reply-To: <20121108110841.GT20488@dusk.harfang.homelinux.org> References: <509A4198.7000409@fu-berlin.de> <005101cdbce4$bd30a6a0$3791f3e0$@gmail.com> <509A5D72.4000100@gmail.com> <00ca01cdbcfd$ccff8ce0$66fea6a0$@gmail.com> <509B6FA4.8060708@gmail.com> <20121108110841.GT20488@dusk.harfang.homelinux.org> Date: Thu, 8 Nov 2012 14:01:29 +0100 Message-ID: Subject: Re: [math] correlation analysis with NaNs From: =?ISO-8859-1?Q?S=E9bastien_Brisard?= To: Commons Developers List Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-AV-Checked: ClamAV using ClamSMTP at svoboda.polytechnique.org (Thu Nov 8 14:01:30 2012 +0100 (CET)) X-Org-Mail: sebastien.brisard.1997@polytechnique.org X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Flag: No, tests=bogofilter, spamicity=0.000241, queueID=D583B140C558C Hi, 2012/11/8 Gilles Sadowski : > On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote: >> Hi Patrick, >> >> On 11/07/2012 04:37 PM, Patrick Meyer wrote: >> > I agree that it would be nice to have a constructor that allows you to >> > specific the ranking algorithm only. >> > >> > As far as NaN and the Spearman correlation, maybe we should add a defa= ult >> > strategy of NaNStrategy.FAIL so that an exception would occur if any N= aN is >> > encountered. R uses this treatment of missing data and forces users to >> > choose how to handle it. If we implemented something like listwise or >> > pairwise deletion it could be used in other classes too. As such, trea= tment >> > of missing data should be part of a larger discussion and handled in a= more >> > comprehensive and systematic way. >> >> I think this additional option makes sense, but I forward this >> discussion to the dev mailing list where it is better suited. > > I'm wary of having CM handle "missing" data. > For one thing we'd have to define a "convention" to represent missing dat= a. > There is no good way to do that in Java. Using NaN for this purpose in a > low-level library is not a good idea IMHO. > I agree with Gilles, here. If I remember correctly, R has a special value NA, or something similar, which differs from NaN. > > Then, any convention might not be > suitable for some user applications, which would lead such an application= 's > developer to filter the data anyway in order to change his representation= to > CM's representation. Rather that calling two redundant filtering codes, I= 'd > rather assume that CM gets a clean input on which its algorithm can opera= te. > As usual, the input is subjected to precondition checks, and exceptions a= re > thrown if the data is not clean enough. > > In summary: data validation (in the sense of discarding input) should not= be > done _before_ calling CM routines. > +1. S=E9bastien > > Regards, > Gilles > >> Thomas >> >> > -----Original Message----- >> > From: Thomas Neidhart [mailto:thomas.neidhart@gmail.com] >> > Sent: Wednesday, November 07, 2012 8:09 AM >> > To: user@commons.apache.org >> > Subject: Re: [math] correlation analysis with NaNs >> > >> > On 11/07/2012 01:38 PM, Patrick Meyer wrote: >> >> You are getting values like 2.5 because of the default ties strategy. >> >> If you do not want to use that method, create an instance of >> >> RankingAlgorithm with a different ties strategy and pass it to the >> >> constructor for the SpearmanCorrelation. This approach also gives you >> >> control over the method for dealing with NaNs. Something like, >> >> >> >> //create data matrix >> >> double[] column1 =3D new double[]{Double.NaN, 1, 2}; double[] column2= =3D >> >> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata =3D new >> >> Array2DRowRealMatrix(); For(int i=3D0;i> >> mydata.addToEntry(i, 0, column1[i]); >> >> mydata.addToEntry(i, 1, column2[i]); >> >> } >> >> >> >> //compute correlation >> >> NaturalRanking ranking =3D new NaturalRanking(NaNStrategy.FIXED, >> >> TiesStrategy.RANDOM); SpearmanCorrelation spearman =3D new >> >> SpearmanCorrelation(ranking, mydata); >> >> >> >> Try that. >> > >> > Hi, >> > >> > this will not really help imho. >> > >> > As far as I can see, there are at least two problems with the current = use of >> > the RankingAlgorithm in the SpearmanCorrelation class: >> > >> > * there is no way to select the ranking algorithm in the constructor >> > without passing the values at the same time >> > * the NaNStrategy.REMOVED does not work symmetrically, i.e. it remove= s >> > the NaN only from the input array where it occurs but not in the >> > corresponding array, thus rendering it useless as it will result in >> > exceptions (array lengths differ) >> > >> > Would you be able to create an issue for this on the issue tracker and >> > provide the test case? >> > >> > Thanks, >> > >> > Thomas > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > For additional commands, e-mail: dev-help@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org For additional commands, e-mail: dev-help@commons.apache.org