# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Sean Owen <sro...@gmail.com>
Subject Re: Mahout/Taste covariance between two items
Date Fri, 27 Nov 2009 12:01:04 GMT
```I'm not so familiar with this formula but you seem to be missing
something in the denominator... it's got to normalize somehow. I think
I said divide by standard deviation but that's not quite it. What you
are really summing are the products of z-scores.  See
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

But I think you should just use the formulation given in the code?
should be the same result. At least I hope these aren't different
definitions of Pearson!

On Fri, Nov 27, 2009 at 10:20 AM, jamborta <jamborta@gmail.com> wrote:
>
> thanks you. much clearer now.
>
> so for my purpose this will do:
>
> sumXY/N-1
>
> given that the data is 'centered'?
>
>
> On Fri, Nov 27, 2009 at 1:41 AM, jamborta <jamborta@gmail.com> wrote:
>>
>> hi. I tried to figure out how you calcualte pearson correlation, but it
>> looks
>> like you use this formula:
>>
>> sumXY / sqrt(sumX2 * sumY2)
>
> Yes that's right -- this is what Pearson reduces to when the mean of X
> and Y are 0. And they are here -- the implementation 'centers' the
> data.
>
>> where sumXY = sumXY - meanY * sumX;
>> sumX2 = sumX2 - meanX * sumX;
>> sumY2 = sumY2 - meanY * sumY;
>
> You see the lines commented out there? Those are the full forms of the
> expressions, which may make more sense. This is centering the data,
> making the mean 0.
>
> This is a simplification based on the observation that, for example,
> sumX * meanY = sumY * meanX = n * meanY * meanX.
>
>>
>> i don't really understand how you got these equations. could you explain
>> it
>> to me? I thought pearson correlation would be like this
>>
>> E(x_i-meanX)(y_i-meanY) / sdX*sdY
>
> That's right that's the expression for a population correlation, but
> we can really only compute a sample Pearson correlation coefficient,
> yes:
>
>
>> for my project I would need to get sample correlation coefficient which
>> would be something like this:
>>
>> sum(x_i-meanX)(y_i-meanY)/(N-1)
>
> Yeah that's fine too, this is another way of expressing the formula,
> though you're missing the two standard deviations in the denominator.
> It'll be clearer if I note that the mean of X and Y are 0.
>
>
>
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

```
Mime
View raw message