mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Bug in similarity computation
Date Wed, 06 Apr 2011 15:33:11 GMT
It's a good question.

The Pearson correlation of two series does not change if the series
means change. That is, subtracting the same value from all elements of
one series (or scaling the values) doesn't change the correlation. In
that sense, I would not say you must center the series to make either
one's mean 0. It wouldn't make a difference, no matter what number you
subtracted, even if it were the mean of all ratings by the user.

The code you see in the project *does* center the data, because *if*
the means are 0, then the computation result is the same as the cosine
measure, and that seems nice. (There's also an uncentered cosine
measure version.)


What I think you're really getting at is, can't we expand the series
to include all items that either one or the other user rated? Then the
question is, what are the missing values you want to fill in? There's
not a great answer to that, since any answer is artificial, but
picking the user's mean rating is a decent choice. This is not quite
the same as centering.

You can do that in Mahout -- use AveragingPreferenceInferrer to do
exactly this with these similarity metrics. It will slow things down
and anecdotally I don't think it's worth it, but it's certainly there.

I don't think the normal version, without a PreferenceInferrer, is
"wrong". It is just implementing the Pearson correlation on all data
available, and you have to add a setting to tell it to make up data.



On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki
<alejandro.bellogin@uam.es> wrote:
> Hi all,
>
> I've been using Mahout for many years now, mainly for my Master's thesis,
> and now for my PhD thesis. That is why, first, I want to congratulate you
> for the effort of putting such a library as open source.
>
> At this point, my main concern is recommendation, and, because of that, I
> have been using the different recommenders, evaluators and similarities
> implemented in the library. However, today, after many times inspecting your
> code, I have found, IMHO, a relevant bug with further implications.
>
> It is related with the computation of the similarity. Although this is not
> the only implemented similarity, Pearson's correlation is one of the most
> popular one. This similarity requires to normalise (or "center") the data
> using the user's mean, in order to be able to distinguish a user who usually
> rates items with 5's from a user who usually rates them with 3's, even
> though in a particular item both rated it with a 5. The problem is that the
> user's means are being calculated using ONLY the items in common between the
> two users, leading to strange similarity computations (or worse, to no
> similarity at all!). It is not difficult to find small examples showing this
> behaviour, besides, seminal papers assume the overall mean rating is used
> [1, 2].
>
> Since I am a newbie on this patch and bug/fix terminology, I would like to
> know what is the best (or the only?) way of including this finding. I have
> to say that I already have fixed the code (it affects to the
> AbstractSimilarity class, and therefore, it would have an impact on other
> similarity functions too).
>
> Best regards,
> Alejandro
>
> [1] M. J. Pazzani: "A framework for collaborative, content-based and
> demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999
> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based
> recommendation methods". Recommender Systems Handbook, chapter 4. 2010
>
> --
>  Alejandro Bellogin Kouki
>  http://rincon.uam.es/dir?cw=435275268554687
>
>

Mime
View raw message