i really just want to get the sample covariance which is:
sum(X_i  meanX)(Y_i  meanY)/N1
this is just
pearson_x,y * sdX * sdY
i think sumXY/N1 should be the right one.
srowen wrote:
>
> I'm not so familiar with this formula but you seem to be missing
> something in the denominator... it's got to normalize somehow. I think
> I said divide by standard deviation but that's not quite it. What you
> are really summing are the products of zscores. See
> http://en.wikipedia.org/wiki/Pearson_productmoment_correlation_coefficient
>
> But I think you should just use the formulation given in the code?
> should be the same result. At least I hope these aren't different
> definitions of Pearson!
>
> On Fri, Nov 27, 2009 at 10:20 AM, jamborta <jamborta@gmail.com> wrote:
>>
>> thanks you. much clearer now.
>>
>> so for my purpose this will do:
>>
>> sumXY/N1
>>
>> given that the data is 'centered'?
>>
>>
>> On Fri, Nov 27, 2009 at 1:41 AM, jamborta <jamborta@gmail.com> wrote:
>>>
>>> hi. I tried to figure out how you calcualte pearson correlation, but it
>>> looks
>>> like you use this formula:
>>>
>>> sumXY / sqrt(sumX2 * sumY2)
>>
>> Yes that's right  this is what Pearson reduces to when the mean of X
>> and Y are 0. And they are here  the implementation 'centers' the
>> data.
>>
>>> where sumXY = sumXY  meanY * sumX;
>>> sumX2 = sumX2  meanX * sumX;
>>> sumY2 = sumY2  meanY * sumY;
>>
>> You see the lines commented out there? Those are the full forms of the
>> expressions, which may make more sense. This is centering the data,
>> making the mean 0.
>>
>> This is a simplification based on the observation that, for example,
>> sumX * meanY = sumY * meanX = n * meanY * meanX.
>>
>>>
>>> i don't really understand how you got these equations. could you explain
>>> it
>>> to me? I thought pearson correlation would be like this
>>>
>>> E(x_imeanX)(y_imeanY) / sdX*sdY
>>
>> That's right that's the expression for a population correlation, but
>> we can really only compute a sample Pearson correlation coefficient,
>> yes:
>>
>>
>>> for my project I would need to get sample correlation coefficient which
>>> would be something like this:
>>>
>>> sum(x_imeanX)(y_imeanY)/(N1)
>>
>> Yeah that's fine too, this is another way of expressing the formula,
>> though you're missing the two standard deviations in the denominator.
>> It'll be clearer if I note that the mean of X and Y are 0.
>>
>>
>>
>> 
>> View this message in context:
>> http://old.nabble.com/MahoutTastecovariancebetweentwoitemstp26530825p26540395.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>>
>
>

View this message in context: http://old.nabble.com/MahoutTastecovariancebetweentwoitemstp26530825p26541591.html
Sent from the Mahout User List mailing list archive at Nabble.com.
