You are getting values like 2.5 because of the default ties strategy. If you
do not want to use that method, create an instance of RankingAlgorithm with
a different ties strategy and pass it to the constructor for the
SpearmanCorrelation. This approach also gives you control over the method
for dealing with NaNs. Something like,
//create data matrix
double[] column1 = new double[]{Double.NaN, 1, 2};
double[] column2 = new double[]{10, 2, 10};
Array2DRowRealMatrix mydata = new Array2DRowRealMatrix();
For(int i=0;i<column1.length;i++){
mydata.addToEntry(i, 0, column1[i]);
mydata.addToEntry(i, 1, column2[i]);
}
//compute correlation
NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
TiesStrategy.RANDOM);
SpearmanCorrelation spearman = new SpearmanCorrelation(ranking, mydata);
Try that.
Original Message
From: Martin Rosellen [mailto:Martin.Rosellen@fuberlin.de]
Sent: Wednesday, November 07, 2012 6:10 AM
To: Commons Users List
Subject: [math] correlation analysis with NaNs
Dear all,
I have difficulties using the Spearman correlation analysis with double
arrays that may contain NaN entries. As you see in my example I want to
analyse the columns with entries {Double.NaN, 1, 2} and {10, 2, 10}. The
output of the execution of the code below is:
Ranking [1.0, 2.0]
Ranking [2.5, 1.0, 2.5]
correlations 0.8660254037844386
{code}
double[] column1 = new double[]{Double.NaN, 1, 2};
double[] column2 = new double[]{10, 2, 10};
NaturalRanking rank = new NaturalRanking(NaNStrategy.REMOVED);
double[] ranking1 = rank.rank(column1);
double[] ranking2 = rank.rank(column2);
System.out.println("Ranking " + Arrays.toString(ranking1));
System.out.println("Ranking " + Arrays.toString(ranking2));
SpearmansCorrelation s_corrs = new SpearmansCorrelation();
double correlations = s_corrs.correlation(column1, column2);
System.out.println("correlations " + correlations); {code}
Like I understand Spearman the result of the correlation should be 1 because
tuples that contain NaNs should be ignored in the ranking and in the
correlation analysis. What I don't understand is why there are ranks like
2.5.
My workaround works as follows:
 use NaNStrategy.FIXED, so that the NaNs stay in place
 execute the ranking
 round down the ranks like 2.5 if they are not NaN (NaNs are cast to 0.0)
 execute custom Pearson correlation that ignores tuples with NaNs on the
ranked arrays
Here is the code:
{code}
double[] column1 = new double[]{Double.NaN, 1, 2};
double[] column2 = new double[]{10, 2, 10};
NaturalRanking rank = new NaturalRanking(NaNStrategy.FIXED);
double[] ranking1 = rank.rank(column1);
double[] ranking2 = rank.rank(column2);
for (int i = 0; i < ranking1.length; i++) {
if (!Double.isNaN(ranking1[i])) {
ranking1[i] = (int) ranking1[i];
}
if (!Double.isNaN(ranking2[i])) {
ranking2[i] = (int) ranking2[i];
}
}
System.out.println("Ranking " + Arrays.toString(ranking1));
System.out.println("Ranking " + Arrays.toString(ranking2));
PearsonsCorrelation p_corrs = new PearsonsCorrelation();
double correlations = p_corrs.correlationNaNs(column1, column2);
System.out.println("correlations " + correlations); {code}
I hope that my solution for dealing with NaNs isn't missing anything.
Perhaps you can comment on this.
Kind regards
Martin

To unsubscribe, email: userunsubscribe@commons.apache.org
For additional commands, email: userhelp@commons.apache.org

To unsubscribe, email: userunsubscribe@commons.apache.org
For additional commands, email: userhelp@commons.apache.org
