I'm not so familiar with this formula but you seem to be missing something in the denominator... it's got to normalize somehow. I think I said divide by standard deviation but that's not quite it. What you are really summing are the products of z-scores. See http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient But I think you should just use the formulation given in the code? should be the same result. At least I hope these aren't different definitions of Pearson! On Fri, Nov 27, 2009 at 10:20 AM, jamborta wrote: > > thanks you. much clearer now. > > so for my purpose this will do: > > sumXY/N-1 > > given that the data is 'centered'? > > > On Fri, Nov 27, 2009 at 1:41 AM, jamborta wrote: >> >> hi. I tried to figure out how you calcualte pearson correlation, but it >> looks >> like you use this formula: >> >> sumXY / sqrt(sumX2 * sumY2) > > Yes that's right -- this is what Pearson reduces to when the mean of X > and Y are 0. And they are here -- the implementation 'centers' the > data. > >> where sumXY = sumXY - meanY * sumX; >> sumX2 = sumX2 - meanX * sumX; >> sumY2 = sumY2 - meanY * sumY; > > You see the lines commented out there? Those are the full forms of the > expressions, which may make more sense. This is centering the data, > making the mean 0. > > This is a simplification based on the observation that, for example, > sumX * meanY = sumY * meanX = n * meanY * meanX. > >> >> i don't really understand how you got these equations. could you explain >> it >> to me? I thought pearson correlation would be like this >> >> E(x_i-meanX)(y_i-meanY) / sdX*sdY > > That's right that's the expression for a population correlation, but > we can really only compute a sample Pearson correlation coefficient, > yes: > > >> for my project I would need to get sample correlation coefficient which >> would be something like this: >> >> sum(x_i-meanX)(y_i-meanY)/(N-1) > > Yeah that's fine too, this is another way of expressing the formula, > though you're missing the two standard deviations in the denominator. > It'll be clearer if I note that the mean of X and Y are 0. > > > > -- > View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html > Sent from the Mahout User List mailing list archive at Nabble.com. > >