Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of srowen@gmail.com designates
 72.14.220.156 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=SNR2hnBPq3Yuy3oX0NdkVgPnJYvfF6+XR47elRvJXJu81Q3mDjV7ClogHyFKuy8+eZ
         D2qK+LGX5OW5rn02mKVG9YDNxsCBNcTztw48doA5rVjDhhC7xtW+m0ByyZbdNKEzoYZ6
         wa9IeIWWsbPbvxPpoIduOdNJo4QsbkwT/Wlpw=
MIME-Version: 1.0
In-Reply-To: <4BD7EC4F.5050800@zalando.de>
References: <7270125.49731272428911412.JavaMail.jira@thor>
	 <4BD7EC4F.5050800@zalando.de>
Date: Wed, 28 Apr 2010 09:55:45 +0100
Message-ID: <y2me2e029611004280155ge2aac79bz95b9bf48ed0ae9cc@mail.gmail.com>
Subject: Re: [jira] Updated: (MAHOUT-387) Cosine item similarity
	implementation
From: Sean Owen <srowen@gmail.com>
To: mahout-dev@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Well it's not hard to add something like UncenteredCosineSimilarity
for sure, I don't mind. It's actually a matter of configuring the
superclass to center or not.

But it's also easy to center the data in the M/R. I agree it makes
little difference in your case, and the effect is subtle. I can add
centering, to at least have it in the code for consistency -- in
addition to adding the implementation above for completeness.

While I'm at it I think we might be able to simplify this item-item
computation. A straightforward alternative is something like this:

1. Compute item vectors (using Vector output, ideally) in one M/R
2M. For each item-item pair output both vectors
2R. With both Vectors in hand easily compute the cosine measure

It's quite simple, and lends itself to dropping in a different reducer
stage to implement different similarities, which is great.

The question is performance. Say I have P prefs, U users and I items.
Assume P/I prefs per item. The bottleneck of this stage is outputting
I^2 vectors of size P/I, or about P*I output.

What you're doing now bottlenecks in the last bit where you output for
each user some data for every pair of items. That's coarsely U *
(P/U)^2 =3D U*P^2 output. I think that P^2 is killer.

Double-check my thinking but if you like it I think I can
significantly simplify this job. I can maybe make a patch for you to
try.

On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
<sebastian.schelter@zalando.de> wrote:
> However, I'm a little bit confused now, let me explain why I thought
> that this additional similarity implementation would be necessary: Three
> weeks ago, I submitted a patch computing the cosine item similarities
> via map-reduce (MAHOUT-362), which is how I currently fill my database
> table of precomputed item-item-similarities. Some days ago I needed to
> precompute lots of recommendations offline via
> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
> to connect my map-reduce code to the database. So I was in need of a
> similarity implementation that would give me the same results on the fly
> from a FileDataModel, which is how I came to create the
> CosineItemSimilarity implementation. So my point here would be that if
> the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
> center the data, a non-distributed implementation of that way of
> computing the similarity should be available too, or the code in
> org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
> center the data.
>
> You stated that centering the data "adjusts for a user's tendency to
> rate high or low on average" which is certainly necessary when you
> collect explicit ratings from your users. However in my usecase (a big
> online-shopping platform) I unfortunately do not have explicit ratings
> from the users, so I chose to interpret certain actions as ratings (I
> recall this is called implicit ratings in the literature), e.g. a user
> putting an item into their shopping basket as a rating of 3 or a user
> purchasing an item as a rating of 5, like suggested in the "Making
> Recommendations" chapter" of "Programming Collective Intelligence" by T.
> Searagan. As far as I can judge (to be honest my mathematical knowledge
> is kinda limited) there are no different interpretations of the rating
> scala here as the values are fixed, so I thought that a centering of the
> data would not be necessary.
>
> Regards,
> Sebastian
>
> Sean Owen (JIRA) schrieb:
>> =C2=A0 =C2=A0 =C2=A0[ https://issues.apache.org/jira/browse/MAHOUT-387?p=
age=3Dcom.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Sean Owen updated MAHOUT-387:
>> -----------------------------
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Status: Resolved =C2=A0(was: Pa=
tch Available)
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Assignee: Sean Owen
>> =C2=A0 =C2=A0 Fix Version/s: 0.3
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0Resolution: Won't Fix
>>
>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity=
. In the case where the mean of each series is 0, the result is the same. T=
he fastest way I know to see this is to just look at this form of the sampl=
e correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b=
380c9bc4e27.png ... and note that sum(xi) =3D sum (yi) =3D 0 when the mean =
of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is =
the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, =
which are the vector sizes. This is just the cosine of the angle between x =
and y.
>>
>> One can argue whether forcing the data to be centered is right. I think =
it's a good thing in all cases. It adjusts for a user's tendency to rate hi=
gh or low on average. It also makes the computation simpler, and more consi=
stent with Pearson (well, it makes it identical!). This has a good treatmen=
t:
>> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coeffici=
ent#Geometric_interpretation
>>
>> Only for this reason I'd mark this as won't-fix for the moment; the patc=
h is otherwise nice. I'd personally like to hear more about why to not cent=
er if there's an argument for it.
>>
>>
>>> Cosine item similarity implementation
>>> -------------------------------------
>>>
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Key: MAHOUT-387
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 URL: https://is=
sues.apache.org/jira/browse/MAHOUT-387
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Project: Mahout
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Issue Type: New Feature
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Components: Collaborative Filtering
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Reporter: Sebastian Schelter
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Assignee: Sean Owen
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Fix For: 0.3
>>>
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 Attachments: MAHOUT-387.patch
>>>
>>>
>>> I needed to compute the cosine similarity between two items when runnin=
g org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find =
an implementation (did I overlook it maybe?) so I created my own. I want to=
 share it here, in case you find it useful.
>>>
>>
>>
>
>