Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 118D717804 for ; Wed, 1 Oct 2014 13:12:26 +0000 (UTC) Received: (qmail 49310 invoked by uid 500); 1 Oct 2014 13:12:24 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 49238 invoked by uid 500); 1 Oct 2014 13:12:24 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 49226 invoked by uid 99); 1 Oct 2014 13:12:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 13:12:24 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rohit.parimi@gmail.com designates 209.85.192.50 as permitted sender) Received: from [209.85.192.50] (HELO mail-qg0-f50.google.com) (209.85.192.50) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 13:11:58 +0000 Received: by mail-qg0-f50.google.com with SMTP id q108so65898qgd.9 for ; Wed, 01 Oct 2014 06:11:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=32Y+/wCTXzk+CGzVzymUQgWh6LyoJnhBt4YJpBG417I=; b=qWuG4mvsLEUsugJ946jPpLbgHhYd5GtG/kuUFzWEh4jTxlt7NQE2Ag4NvuhuWLxVz5 w0EwxgF51MnrUPFOjHHjhzxunJrPPHnkQh6FmjdIDSHnRnjYftcVFD/VNTrj84zw8jCR z1fxJD0zFkVUSALAU8x0vqtrklcbmvF+/POSv+MV1dpMS45/omMp9cM3NI98gor//Cwl rFKPrA9LtnuqisZvf9p5Zecb4rDtyarzZEtZd+3Xba/HaDfW1f1BK1poGDu2euGEf/OR rQdeo+ZnFYil9J3kN0ubL3LNvc8Q5gMh6ruJ0sPwsA0sqAW5WNyhYiFCoyHCr7YCNhag Tqgg== MIME-Version: 1.0 X-Received: by 10.140.38.231 with SMTP id t94mr84932015qgt.3.1412169116969; Wed, 01 Oct 2014 06:11:56 -0700 (PDT) Received: by 10.140.39.20 with HTTP; Wed, 1 Oct 2014 06:11:56 -0700 (PDT) In-Reply-To: References: Date: Wed, 1 Oct 2014 08:11:56 -0500 Message-ID: Subject: Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback! From: Parimi Rohit To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001a11c13ed094711b05045c3eb7 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c13ed094711b05045c3eb7 Content-Type: text/plain; charset=UTF-8 Thanks Ted! Will look into it. Rohit On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning wrote: > Here is a paper that includes an analysis of voting patterns using LDA. > > http://arxiv.org/pdf/math/0604410.pdf > > > > On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit > wrote: > > > Ted, > > > > I know LDA can be used to model text data but never used it in this > > setting. Can you please give me some pointers on how I can apply it in > this > > setting? > > > > Thanks, > > Rohit > > > > On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning > > wrote: > > > > > This is an incredibly tiny dataset. If you delete singletons, it is > > likely > > > to get significantly smaller. > > > > > > I think that something like LDA might work much better for you. It was > > > designed to work on small data like this. > > > > > > > > > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit > > > > wrote: > > > > > > > Ted, Thanks for your response. Following is the information about the > > > > approach and the datasets: > > > > > > > > I am using the ItemSimilarityJob and passing it "itemID, userID, > > > > prefCount" tuples as input to compute user-user similarity using > LLR. I > > > > read this approach from a response for one of the stackoverflow > > questions > > > > on calculating user similarity using mahout. . > > > > > > > > > > > > Following are the stats for the datasets: > > > > > > > > Coauthor dataset: > > > > > > > > users = 29189 > > > > items = 140091 > > > > averageItemsClicked = 15.808660796875536 > > > > > > > > Conference Dataset: > > > > > > > > users = 29189 > > > > items = 2393 > > > > averageItemsClicked = 7.265099866388023 > > > > > > > > Reference Dataset: > > > > > > > > users = 29189 > > > > items = 201570 > > > > averageItemsClicked = 61.08564870327863 > > > > > > > > By Scale, did you mean rating scale? If so, I am using preference > > counts, > > > > not rating. > > > > > > > > Thanks, > > > > Rohit > > > > > > > > > > > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning > > > > > wrote: > > > > > > > > > How are you using LLR to compute user similarity? It is normally > > used > > > to > > > > > compute item similarity? > > > > > > > > > > Also, what is your scale? how many users? how many items? how > many > > > > > actions per user? > > > > > > > > > > > > > > > > > > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit < > > rohit.parimi@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am exploring a random-walk based algorithm for recommender > > systems > > > > > which > > > > > > works by propagating the item preferences for users on the > > user-user > > > > > graph. > > > > > > To do this, I have to compute user-user similarity and form a > > > > > neighborhood. > > > > > > I have tried the following three simple techniques to compute the > > > score > > > > > > between two users and find the neighborhood. > > > > > > > > > > > > 1. Score = (Common Items between users A and B) / (items > preferred > > by > > > > A + > > > > > > items Preferred by B) > > > > > > 2. Scoring based on Mahout's Cosine Similarity > > > > > > 3. Scoring based on Mahout's LogLikelihood similarity. > > > > > > > > > > > > My understanding is that similarity based on LogLikelihood is > more > > > > > robust, > > > > > > however, I get better results using the naive approach > (technique 1 > > > > from > > > > > > the above list). The problems I am addressing are collaborator > > > > > > recommendation, conference recommendation and reference > > > recommendation > > > > > and > > > > > > the data has implicit feedback. > > > > > > > > > > > > So, my questions is, are there any cases where cosine similarity > > and > > > > > > loglikelihood metrics fail (to capture similarity), for example, > > for > > > > the > > > > > > problems stated above, users only collaborate with few other > users > > > > (based > > > > > > on area of interest), publish in only few conferences (again > based > > on > > > > > area > > > > > > of interest) and refer to publications in a specific domain. So, > > the > > > > > > preference counts are fairly small compared to other domains > > > > (music/video > > > > > > etc). > > > > > > > > > > > > Secondly, for CosineSimilarity, should I treat the preferences as > > > > boolean > > > > > > or use the counts? (I think loglikelihood metric does not take > into > > > > > account > > > > > > the preference counts.. correct me if I am wrong.) > > > > > > > > > > > > Any insight into this is much appreciated. > > > > > > > > > > > > Thanks, > > > > > > Rohit > > > > > > > > > > > > p.s. Ted, Pat: I am following the discussion on the thread > > > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me > a > > > lot > > > > to > > > > > > understand how it works and made me wonder why things are > different > > > in > > > > my > > > > > > case. > > > > > > > > > > > > > > > > > > > > > --001a11c13ed094711b05045c3eb7--