mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation
Date Wed, 28 Apr 2010 08:55:45 GMT
Well it's not hard to add something like UncenteredCosineSimilarity
for sure, I don't mind. It's actually a matter of configuring the
superclass to center or not.

But it's also easy to center the data in the M/R. I agree it makes
little difference in your case, and the effect is subtle. I can add
centering, to at least have it in the code for consistency -- in
addition to adding the implementation above for completeness.

While I'm at it I think we might be able to simplify this item-item
computation. A straightforward alternative is something like this:

1. Compute item vectors (using Vector output, ideally) in one M/R
2M. For each item-item pair output both vectors
2R. With both Vectors in hand easily compute the cosine measure

It's quite simple, and lends itself to dropping in a different reducer
stage to implement different similarities, which is great.

The question is performance. Say I have P prefs, U users and I items.
Assume P/I prefs per item. The bottleneck of this stage is outputting
I^2 vectors of size P/I, or about P*I output.

What you're doing now bottlenecks in the last bit where you output for
each user some data for every pair of items. That's coarsely U *
(P/U)^2 = U*P^2 output. I think that P^2 is killer.

Double-check my thinking but if you like it I think I can
significantly simplify this job. I can maybe make a patch for you to

On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
<> wrote:
> However, I'm a little bit confused now, let me explain why I thought
> that this additional similarity implementation would be necessary: Three
> weeks ago, I submitted a patch computing the cosine item similarities
> via map-reduce (MAHOUT-362), which is how I currently fill my database
> table of precomputed item-item-similarities. Some days ago I needed to
> precompute lots of recommendations offline via
> and didn't want
> to connect my map-reduce code to the database. So I was in need of a
> similarity implementation that would give me the same results on the fly
> from a FileDataModel, which is how I came to create the
> CosineItemSimilarity implementation. So my point here would be that if
> the code in does not
> center the data, a non-distributed implementation of that way of
> computing the similarity should be available too, or the code in
> should be changed to
> center the data.
> You stated that centering the data "adjusts for a user's tendency to
> rate high or low on average" which is certainly necessary when you
> collect explicit ratings from your users. However in my usecase (a big
> online-shopping platform) I unfortunately do not have explicit ratings
> from the users, so I chose to interpret certain actions as ratings (I
> recall this is called implicit ratings in the literature), e.g. a user
> putting an item into their shopping basket as a rating of 3 or a user
> purchasing an item as a rating of 5, like suggested in the "Making
> Recommendations" chapter" of "Programming Collective Intelligence" by T.
> Searagan. As far as I can judge (to be honest my mathematical knowledge
> is kinda limited) there are no different interpretations of the rating
> scala here as the values are fixed, so I thought that a centering of the
> data would not be necessary.
> Regards,
> Sebastian
> Sean Owen (JIRA) schrieb:
>>      [
>> Sean Owen updated MAHOUT-387:
>> -----------------------------
>>            Status: Resolved  (was: Patch Available)
>>          Assignee: Sean Owen
>>     Fix Version/s: 0.3
>>        Resolution: Won't Fix
>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the
case where the mean of each series is 0, the result is the same. The fastest way I know to
see this is to just look at this form of the sample correlation:
... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with
sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2))
in the denominator, which are the vector sizes. This is just the cosine of the angle between
x and y.
>> One can argue whether forcing the data to be centered is right. I think it's a good
thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also
makes the computation simpler, and more consistent with Pearson (well, it makes it identical!).
This has a good treatment:
>> Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise
nice. I'd personally like to hear more about why to not center if there's an argument for
>>> Cosine item similarity implementation
>>> -------------------------------------
>>>                 Key: MAHOUT-387
>>>                 URL:
>>>             Project: Mahout
>>>          Issue Type: New Feature
>>>          Components: Collaborative Filtering
>>>            Reporter: Sebastian Schelter
>>>            Assignee: Sean Owen
>>>             Fix For: 0.3
>>>         Attachments: MAHOUT-387.patch
>>> I needed to compute the cosine similarity between two items when running,
I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to
share it here, in case you find it useful.

View raw message