mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Questions about PearsonCorrelation on a example
Date Tue, 23 Jun 2009 23:20:24 GMT
Agree, again that is why there is a weighting option in the Pearson
implementation, to deemphasize small count-based computations.

Do any of the approaches you cite take into the account the value of
the rating itself? I agree, seems like there should be some
alternative to Pearson / cosine-measure to offer, but right now it's
the only similarity metric that cares about the rating.

On Tue, Jun 23, 2009 at 7:17 PM, Ted Dunning<> wrote:
> To beat a very tired horse, I think that all squared error correlation
> measures (Pearson's chi-squared, Pearson's correlation, squared deviation
> and so on) are completely suspect for small count data.  Furthermore, any
> reasonable sample of truly long-tail phenomena includes great numbers of
> small counts.  Furtherfurthermore, long-tail phenomena are the rule rather
> than the exception.
> Thus, I almost never like these measures and would have a hard time arguing
> that there is anything good about this kind of measure.  The only exception
> would be in a pub where I would take any side of any debate for the
> amusement of the crowd.
> Try mutual information or multinomial likelihood ratios instead.
> On Tue, Jun 23, 2009 at 3:48 PM, Sean Owen <> wrote:
>> One could argue that this behavior is actually a good thing -- basing
>> an estimate of similarity based on one data point could be very
>> unreliable.

View raw message