Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 98880 invoked from network); 28 Apr 2010 08:56:14 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Apr 2010 08:56:14 -0000 Received: (qmail 24297 invoked by uid 500); 28 Apr 2010 08:56:14 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 24155 invoked by uid 500); 28 Apr 2010 08:56:12 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 24142 invoked by uid 99); 28 Apr 2010 08:56:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Apr 2010 08:56:11 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of srowen@gmail.com designates 72.14.220.156 as permitted sender) Received: from [72.14.220.156] (HELO fg-out-1718.google.com) (72.14.220.156) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Apr 2010 08:56:07 +0000 Received: by fg-out-1718.google.com with SMTP id l26so1818856fgb.5 for ; Wed, 28 Apr 2010 01:55:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=xeNG5YgU2UE0Z40p+9+W/bIakXSDeg5pVogNqLxeA/g=; b=VY5VbYq6+75cS5zUbFtA/lR3mDatGLwDWHwXBj4G1YgfLWOunV5WOsyfFGG13A/dHF l78LITXdFhq5mRzUJi0ytWy8S/rgAwPWYAbohqO1O1ajz46nwh02mYnII5wiJPys+nDY ZA1l2uHcQ7c1P8myzUuosjUF2ztbHe/zlG3w4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=SNR2hnBPq3Yuy3oX0NdkVgPnJYvfF6+XR47elRvJXJu81Q3mDjV7ClogHyFKuy8+eZ D2qK+LGX5OW5rn02mKVG9YDNxsCBNcTztw48doA5rVjDhhC7xtW+m0ByyZbdNKEzoYZ6 wa9IeIWWsbPbvxPpoIduOdNJo4QsbkwT/Wlpw= MIME-Version: 1.0 Received: by 10.239.189.134 with SMTP id t6mr675695hbh.73.1272444945523; Wed, 28 Apr 2010 01:55:45 -0700 (PDT) Received: by 10.239.188.148 with HTTP; Wed, 28 Apr 2010 01:55:45 -0700 (PDT) In-Reply-To: <4BD7EC4F.5050800@zalando.de> References: <7270125.49731272428911412.JavaMail.jira@thor> <4BD7EC4F.5050800@zalando.de> Date: Wed, 28 Apr 2010 09:55:45 +0100 Message-ID: Subject: Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation From: Sean Owen To: mahout-dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Well it's not hard to add something like UncenteredCosineSimilarity for sure, I don't mind. It's actually a matter of configuring the superclass to center or not. But it's also easy to center the data in the M/R. I agree it makes little difference in your case, and the effect is subtle. I can add centering, to at least have it in the code for consistency -- in addition to adding the implementation above for completeness. While I'm at it I think we might be able to simplify this item-item computation. A straightforward alternative is something like this: 1. Compute item vectors (using Vector output, ideally) in one M/R 2M. For each item-item pair output both vectors 2R. With both Vectors in hand easily compute the cosine measure It's quite simple, and lends itself to dropping in a different reducer stage to implement different similarities, which is great. The question is performance. Say I have P prefs, U users and I items. Assume P/I prefs per item. The bottleneck of this stage is outputting I^2 vectors of size P/I, or about P*I output. What you're doing now bottlenecks in the last bit where you output for each user some data for every pair of items. That's coarsely U * (P/U)^2 =3D U*P^2 output. I think that P^2 is killer. Double-check my thinking but if you like it I think I can significantly simplify this job. I can maybe make a patch for you to try. On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter wrote: > However, I'm a little bit confused now, let me explain why I thought > that this additional similarity implementation would be necessary: Three > weeks ago, I submitted a patch computing the cosine item similarities > via map-reduce (MAHOUT-362), which is how I currently fill my database > table of precomputed item-item-similarities. Some days ago I needed to > precompute lots of recommendations offline via > org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want > to connect my map-reduce code to the database. So I was in need of a > similarity implementation that would give me the same results on the fly > from a FileDataModel, which is how I came to create the > CosineItemSimilarity implementation. So my point here would be that if > the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not > center the data, a non-distributed implementation of that way of > computing the similarity should be available too, or the code in > org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to > center the data. > > You stated that centering the data "adjusts for a user's tendency to > rate high or low on average" which is certainly necessary when you > collect explicit ratings from your users. However in my usecase (a big > online-shopping platform) I unfortunately do not have explicit ratings > from the users, so I chose to interpret certain actions as ratings (I > recall this is called implicit ratings in the literature), e.g. a user > putting an item into their shopping basket as a rating of 3 or a user > purchasing an item as a rating of 5, like suggested in the "Making > Recommendations" chapter" of "Programming Collective Intelligence" by T. > Searagan. As far as I can judge (to be honest my mathematical knowledge > is kinda limited) there are no different interpretations of the rating > scala here as the values are fixed, so I thought that a centering of the > data would not be necessary. > > Regards, > Sebastian > > Sean Owen (JIRA) schrieb: >> =C2=A0 =C2=A0 =C2=A0[ https://issues.apache.org/jira/browse/MAHOUT-387?p= age=3Dcom.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] >> >> Sean Owen updated MAHOUT-387: >> ----------------------------- >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Status: Resolved =C2=A0(was: Pa= tch Available) >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Assignee: Sean Owen >> =C2=A0 =C2=A0 Fix Version/s: 0.3 >> =C2=A0 =C2=A0 =C2=A0 =C2=A0Resolution: Won't Fix >> >> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity= . In the case where the mean of each series is 0, the result is the same. T= he fastest way I know to see this is to just look at this form of the sampl= e correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b= 380c9bc4e27.png ... and note that sum(xi) =3D sum (yi) =3D 0 when the mean = of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is = the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, = which are the vector sizes. This is just the cosine of the angle between x = and y. >> >> One can argue whether forcing the data to be centered is right. I think = it's a good thing in all cases. It adjusts for a user's tendency to rate hi= gh or low on average. It also makes the computation simpler, and more consi= stent with Pearson (well, it makes it identical!). This has a good treatmen= t: >> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coeffici= ent#Geometric_interpretation >> >> Only for this reason I'd mark this as won't-fix for the moment; the patc= h is otherwise nice. I'd personally like to hear more about why to not cent= er if there's an argument for it. >> >> >>> Cosine item similarity implementation >>> ------------------------------------- >>> >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Key: MAHOUT-387 >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 URL: https://is= sues.apache.org/jira/browse/MAHOUT-387 >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Project: Mahout >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Issue Type: New Feature >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Components: Collaborative Filtering >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Reporter: Sebastian Schelter >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Assignee: Sean Owen >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Fix For: 0.3 >>> >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 Attachments: MAHOUT-387.patch >>> >>> >>> I needed to compute the cosine similarity between two items when runnin= g org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find = an implementation (did I overlook it maybe?) so I created my own. I want to= share it here, in case you find it useful. >>> >> >> > >