Mailing-List: contact user-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Users List" <user@commons.apache.org>
Received-SPF: pass (athena.apache.org: domain of Martin.Rosellen@fu-berlin.de
 designates 130.133.4.66 as permitted sender)
Message-ID: <509A4198.7000409@fu-berlin.de>
Date: Wed, 07 Nov 2012 12:10:16 +0100
From: Martin Rosellen <Martin.Rosellen@fu-berlin.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1;
 rv:16.0) Gecko/20121026 Thunderbird/16.0.2
MIME-Version: 1.0
To: Commons Users List <user@commons.apache.org>
Subject: [math] correlation analysis with NaNs
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit

Dear all,

I have difficulties using the Spearman correlation analysis with double 
arrays that may contain NaN entries. As you see in my example I want to 
analyse the columns with entries {Double.NaN, 1, 2} and {10, 2, 10}. The 
output of the execution of the code below is:

Ranking [1.0, 2.0]
Ranking [2.5, 1.0, 2.5]
correlations 0.8660254037844386


{code}
         double[] column1 = new double[]{Double.NaN, 1, 2};
         double[] column2 = new double[]{10, 2, 10};

         NaturalRanking rank = new NaturalRanking(NaNStrategy.REMOVED);
         double[] ranking1 = rank.rank(column1);
         double[] ranking2 = rank.rank(column2);

         System.out.println("Ranking " + Arrays.toString(ranking1));
         System.out.println("Ranking " + Arrays.toString(ranking2));

         SpearmansCorrelation s_corrs = new SpearmansCorrelation();
         double correlations = s_corrs.correlation(column1, column2);

         System.out.println("correlations " + correlations);
{code}

Like I understand Spearman the result of the correlation should be 1 
because tuples that contain NaNs should be ignored in the ranking and in 
the correlation analysis. What I don't understand is why there are ranks 
like 2.5.

My workaround works as follows:
- use NaNStrategy.FIXED, so that the NaNs stay in place
- execute the ranking
- round down the ranks like 2.5 if they are not NaN (NaNs are cast to 0.0)
- execute custom Pearson correlation that ignores tuples with NaNs on 
the ranked arrays

Here is the code:
{code}
double[] column1 = new double[]{Double.NaN, 1, 2};
         double[] column2 = new double[]{10, 2, 10};


         NaturalRanking rank = new NaturalRanking(NaNStrategy.FIXED);

         double[] ranking1 = rank.rank(column1);
         double[] ranking2 = rank.rank(column2);

         for (int i = 0; i < ranking1.length; i++) {
             if (!Double.isNaN(ranking1[i])) {
                 ranking1[i] = (int) ranking1[i];
             }

             if (!Double.isNaN(ranking2[i])) {
                 ranking2[i] = (int) ranking2[i];
             }
         }


         System.out.println("Ranking " + Arrays.toString(ranking1));
         System.out.println("Ranking " + Arrays.toString(ranking2));

         PearsonsCorrelation p_corrs = new PearsonsCorrelation();
         double correlations = p_corrs.correlationNaNs(column1, column2);

         System.out.println("correlations " + correlations);
{code}

I hope that my solution for dealing with NaNs isn't missing anything. 
Perhaps you can comment on this.

Kind regards
Martin


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org