mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: Bug in similarity computation
Date Wed, 06 Apr 2011 15:43:45 GMT
IIRC Sarwar's "Item-Based Collaborative Filtering Recommendation 
Algorithms" explicitly mentions to only use the co-rated cases for 
Pearson correlation.


On 06.04.2011 17:33, Sean Owen wrote:
> It's a good question.
> The Pearson correlation of two series does not change if the series
> means change. That is, subtracting the same value from all elements of
> one series (or scaling the values) doesn't change the correlation. In
> that sense, I would not say you must center the series to make either
> one's mean 0. It wouldn't make a difference, no matter what number you
> subtracted, even if it were the mean of all ratings by the user.
> The code you see in the project *does* center the data, because *if*
> the means are 0, then the computation result is the same as the cosine
> measure, and that seems nice. (There's also an uncentered cosine
> measure version.)
> What I think you're really getting at is, can't we expand the series
> to include all items that either one or the other user rated? Then the
> question is, what are the missing values you want to fill in? There's
> not a great answer to that, since any answer is artificial, but
> picking the user's mean rating is a decent choice. This is not quite
> the same as centering.
> You can do that in Mahout -- use AveragingPreferenceInferrer to do
> exactly this with these similarity metrics. It will slow things down
> and anecdotally I don't think it's worth it, but it's certainly there.
> I don't think the normal version, without a PreferenceInferrer, is
> "wrong". It is just implementing the Pearson correlation on all data
> available, and you have to add a setting to tell it to make up data.
> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki
> <>  wrote:
>> Hi all,
>> I've been using Mahout for many years now, mainly for my Master's thesis,
>> and now for my PhD thesis. That is why, first, I want to congratulate you
>> for the effort of putting such a library as open source.
>> At this point, my main concern is recommendation, and, because of that, I
>> have been using the different recommenders, evaluators and similarities
>> implemented in the library. However, today, after many times inspecting your
>> code, I have found, IMHO, a relevant bug with further implications.
>> It is related with the computation of the similarity. Although this is not
>> the only implemented similarity, Pearson's correlation is one of the most
>> popular one. This similarity requires to normalise (or "center") the data
>> using the user's mean, in order to be able to distinguish a user who usually
>> rates items with 5's from a user who usually rates them with 3's, even
>> though in a particular item both rated it with a 5. The problem is that the
>> user's means are being calculated using ONLY the items in common between the
>> two users, leading to strange similarity computations (or worse, to no
>> similarity at all!). It is not difficult to find small examples showing this
>> behaviour, besides, seminal papers assume the overall mean rating is used
>> [1, 2].
>> Since I am a newbie on this patch and bug/fix terminology, I would like to
>> know what is the best (or the only?) way of including this finding. I have
>> to say that I already have fixed the code (it affects to the
>> AbstractSimilarity class, and therefore, it would have an impact on other
>> similarity functions too).
>> Best regards,
>> Alejandro
>> [1] M. J. Pazzani: "A framework for collaborative, content-based and
>> demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999
>> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based
>> recommendation methods". Recommender Systems Handbook, chapter 4. 2010
>> --
>>   Alejandro Bellogin Kouki

View raw message