# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From jamborta <jambo...@gmail.com>
Subject Re: Mahout/Taste covariance between two items
Date Fri, 27 Nov 2009 12:07:40 GMT
```
i really just want to get the sample covariance which is:

sum(X_i - meanX)(Y_i - meanY)/N-1

this is just

pearson_x,y * sdX * sdY

i think sumXY/N-1 should be the right one.

srowen wrote:
>
> I'm not so familiar with this formula but you seem to be missing
> something in the denominator... it's got to normalize somehow. I think
> I said divide by standard deviation but that's not quite it. What you
> are really summing are the products of z-scores.  See
> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
>
> But I think you should just use the formulation given in the code?
> should be the same result. At least I hope these aren't different
> definitions of Pearson!
>
> On Fri, Nov 27, 2009 at 10:20 AM, jamborta <jamborta@gmail.com> wrote:
>>
>> thanks you. much clearer now.
>>
>> so for my purpose this will do:
>>
>> sumXY/N-1
>>
>> given that the data is 'centered'?
>>
>>
>> On Fri, Nov 27, 2009 at 1:41 AM, jamborta <jamborta@gmail.com> wrote:
>>>
>>> hi. I tried to figure out how you calcualte pearson correlation, but it
>>> looks
>>> like you use this formula:
>>>
>>> sumXY / sqrt(sumX2 * sumY2)
>>
>> Yes that's right -- this is what Pearson reduces to when the mean of X
>> and Y are 0. And they are here -- the implementation 'centers' the
>> data.
>>
>>> where sumXY = sumXY - meanY * sumX;
>>> sumX2 = sumX2 - meanX * sumX;
>>> sumY2 = sumY2 - meanY * sumY;
>>
>> You see the lines commented out there? Those are the full forms of the
>> expressions, which may make more sense. This is centering the data,
>> making the mean 0.
>>
>> This is a simplification based on the observation that, for example,
>> sumX * meanY = sumY * meanX = n * meanY * meanX.
>>
>>>
>>> i don't really understand how you got these equations. could you explain
>>> it
>>> to me? I thought pearson correlation would be like this
>>>
>>> E(x_i-meanX)(y_i-meanY) / sdX*sdY
>>
>> That's right that's the expression for a population correlation, but
>> we can really only compute a sample Pearson correlation coefficient,
>> yes:
>>
>>
>>> for my project I would need to get sample correlation coefficient which
>>> would be something like this:
>>>
>>> sum(x_i-meanX)(y_i-meanY)/(N-1)
>>
>> Yeah that's fine too, this is another way of expressing the formula,
>> though you're missing the two standard deviations in the denominator.
>> It'll be clearer if I note that the mean of X and Y are 0.
>>
>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>>
>
>

--
View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26541591.html
Sent from the Mahout User List mailing list archive at Nabble.com.

```
Mime
View raw message