Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 195AE10AC8 for ; Sat, 25 Jan 2014 23:57:42 +0000 (UTC) Received: (qmail 3292 invoked by uid 500); 25 Jan 2014 23:57:39 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 3192 invoked by uid 500); 25 Jan 2014 23:57:39 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 3184 invoked by uid 99); 25 Jan 2014 23:57:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jan 2014 23:57:38 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.213.172 as permitted sender) Received: from [209.85.213.172] (HELO mail-ig0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jan 2014 23:57:31 +0000 Received: by mail-ig0-f172.google.com with SMTP id k19so5801068igc.5 for ; Sat, 25 Jan 2014 15:57:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=u63qRFTR9h2V38dhG3zj6YnQAoyXBVuWly0IdRwi12o=; b=RkvZUxdtLIABY81M3XnPtbnozgZblj/3EQsr1biBGVhc0+lamCyAYh79xyoNeBFAHR nWji+xgXuEXr/Idti1Xux2slxh0aN6xnDvIP5Zhn1kN0+1SHqLpx9Sw6JkYjq7t33RHj 3s2ZHIEvJrfV2UViSq7o7IuopvsK74IMN1J81aoqfcxYVGjoxmMhwj/JmFXcASGj6jL2 9yx1D6wq16wP5I1gSShb9MG4oiLhaz0gwhqg8ru6aFXCsXLib4htxaLi9vhP2cjhIKEB Ftu6G9Nm63JJmP6WEQ53DQXppX6WSLJLgHaxqneapmvdKAmn1JFkAhKz4xGMD40nuV8g WerQ== X-Received: by 10.50.4.9 with SMTP id g9mr11434063igg.22.1390694230309; Sat, 25 Jan 2014 15:57:10 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.87.231 with HTTP; Sat, 25 Jan 2014 15:56:40 -0800 (PST) In-Reply-To: References: From: Ted Dunning Date: Sat, 25 Jan 2014 15:56:40 -0800 Message-ID: Subject: Re: generic latent variable recommender question To: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=001a11c3184e96c50b04f0d43b72 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3184e96c50b04f0d43b72 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Dithering is commonly done by re-ranking results using a noisy score. Take r to be the original rank (starting with 1). Then compute a score as s =3D log r + N(0,log \epsilon) and sort by this new score in ascending order. Items will be shuffled by this method in such a way that the probability that item 2k will appear before item k is nearly invariant with respect to k. Thus, item 3 will appear before item 1 about as often as item 30 will appear before item 10. The major effect here is to dredge deep results up onto the first page (occasionally) so that the recommendation has broader training data. You can seed this with time in order to get the appearance of changing recommendations even when no change in history is recorded. Moreover, the time varying seed can be held constant for a short period (a few minutes to an hour or so) so that you also give the appearance of short-term stability. Both of these effects seem to entice users back to a recommendation page. Ironically, people seem more willing to return to the first recommendation page than they are willing to click to the second page= . This addition of random noise obviously makes your best recommendation results worse. The penalty is worthwhile to the extent that your recommender learns enough to make results better tomorrow. This has been my universal experience for reasonable levels of dithering. Anti-flood is quite a bit more heuristic and can be motivated by the idea that recommenders are recommending individual items but users are being shown an entire portfolio of items on the first page. The probability of making the user happy with any given page of recommendations is not increased if you show items which are nearly identical because if they like one item, they will very, very likely the others and if they don't like one, they likely won't like the others. On the other hand, if you were two split the page between two groups of very distinctly different kinds of items, if you miss on one group, you don't have a guaranteed miss on the second group and thus you have hedged your bets and will have better user satisfaction. How you accomplish this is largely a UI question. You could cluster the items and show the users 1-2 items from each cluster with an option for seeing the full cluster. You can also use a synthetic score approach where you penalize items that are too similar to items higher in the results list. The meaning of too similar is typically hand crafted to your domain. It might be a test for the same author, or the same genre or whatever you have handy. On Sat, Jan 25, 2014 at 1:42 PM, Tevfik Aytekin w= rote: > Hi Ted, > Could you explain what do you mean by a "dithering step" and an > "anti-flood step"? > By dithering I guess you mean adding some sort of noise in order not > to show the same results every time. > But I have no clue about the anti-flood step. > > Tevfik > > On Sat, Jan 25, 2014 at 11:05 PM, Koobas wrote: > > On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin < > tevfik.aytekin@gmail.com>wrote: > > > >> Case 1 is fine, in case 2, I don't think that a dot product (without > >> normalization) will yield a meaningful distance measure. Cosine > >> distance or a Pearson correlation would be better. The situation is > >> similar to Latent Semantic Indexing in which documents are represented > >> by their low rank approximations and similarities between them (that > >> is, approximations) are computed using cosine similarity. > >> There is no need to make any normalization in case 1 since the values > >> in the feature vectors are formed to approximate the rating values. > >> > >> That's exactly what I was thinking. > > Thanks for your reply. > > > > > >> On Sat, Jan 25, 2014 at 5:08 AM, Koobas wrote: > >> > A generic latent variable recommender question. > >> > I passed the user-item matrix through a low rank approximation, > >> > with either something like ALS or SVD, and now I have the feature > >> > vectors for all users and all items. > >> > > >> > Case 1: > >> > I want to recommend items to a user. > >> > I compute a dot product of the user=E2=80=99s feature vector with al= l feature > >> > vectors of all the items. > >> > I eliminate the ones that the user already has, and find the largest > >> value > >> > among the others, right? > >> > > >> > Case 2: > >> > I want to find similar items for an item. > >> > Should I compute dot product of the item=E2=80=99s feature vector ag= ainst > feature > >> > vectors of all the other items? > >> > OR > >> > Should I compute the ANGLE between each par of feature vectors? > >> > I.e., compute the cosine similarity? > >> > I.e., normalize the vectors before computing the dot products? > >> > > >> > If =E2=80=9Cyes=E2=80=9D for case 2, is that something I should also= do for case 1? > >> > --001a11c3184e96c50b04f0d43b72--