Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CAF4610DD7 for ; Fri, 7 Feb 2014 00:36:26 +0000 (UTC) Received: (qmail 6245 invoked by uid 500); 7 Feb 2014 00:36:24 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 6198 invoked by uid 500); 7 Feb 2014 00:36:23 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 6190 invoked by uid 99); 7 Feb 2014 00:36:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Feb 2014 00:36:23 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.213.169 as permitted sender) Received: from [209.85.213.169] (HELO mail-ig0-f169.google.com) (209.85.213.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Feb 2014 00:36:19 +0000 Received: by mail-ig0-f169.google.com with SMTP id uq10so922253igb.0 for ; Thu, 06 Feb 2014 16:35:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=OPWLSHJM4eKnFfB5FPVUAtgKLKVCn5wV3Tr5yJxOqc0=; b=TYPSCKKhCz1b1xw+iFQKLbm9enZWBCeQ8YABY32INbdkEQFaul9aKBBQEnRQRZic8C Ic3Rsjihvg85qPjg9MbDAHqbH3Ni63Fk2rEJA4Icbx1z9nUZAsuss3dk+OVDR3X5Rl1K KbJxFB/HYvpfMBRzxII/btfwoXyQ7fVRlp+aYEaPJGbDwQcJcj01pdicpgT4Azw/Wkbm 5eXvjeFHg3ZJ4nIzVbDYSF2rrnCPFIc7M9t/PD2VN05XDZQV/Zm24M7nP1+s+f/6wmBa 2kL9ULnZ/vQUcchvyy0TZxmJu6WmE+Uj78y1veel1hSj6HmnmgxqoaysIBBt/SrmMGgw 2iPA== X-Received: by 10.50.176.137 with SMTP id ci9mr2511428igc.31.1391733358671; Thu, 06 Feb 2014 16:35:58 -0800 (PST) MIME-Version: 1.0 Received: by 10.43.6.9 with HTTP; Thu, 6 Feb 2014 16:35:28 -0800 (PST) In-Reply-To: References: <50977135-E391-41BB-9148-07E7E88B0409@gmail.com> From: Ted Dunning Date: Fri, 7 Feb 2014 01:35:28 +0100 Message-ID: Subject: Re: Popularity of recommender items To: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=089e0111e0da773efb04f1c62c1d X-Virus-Checked: Checked by ClamAV on apache.org --089e0111e0da773efb04f1c62c1d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Rising popularity is often a better match to what people want to see on a "most popular" page. The best measure for that in my experience is log (new_count + offset) / (old_count + offset) where new and old counts are the number of views during the periods in question and offset is used partly to avoid log(0) or x/0 problems, but also to give a Bayesian grounding to the measure. On Thu, Feb 6, 2014 at 5:33 PM, Sean Owen wrote: > Agree - I thought by asking for most popular you meant to look for apple > pie. > > Agree with you and Ted that the sum of similarity says something > interesting even if it is not popularity exactly. > On Feb 6, 2014 11:16 AM, "Pat Ferrel" wrote: > > > The problem with the usual preference count is that big hit items can b= e > > overwhelmingly popular. If you want to know which ones the most people > saw > > and are likely to have an opinion about then this seems a good measure. > But > > these hugely popular items may not differentiate taste. > > > > So we calculate the =E2=80=9Cimportant=E2=80=9D taste indicators with L= LR. The benefit of > > the similarity matrix is that it attempts to model the =E2=80=9Cimporta= nt=E2=80=9D > > cooccurrences. > > > > There is an affect of hugely popular items where they really say nothin= g > > about similarity of taste. Everyone likes motherhood and Apple pie so i= t > > doesn=E2=80=99t say much about us if we both do to. This is usually acc= ounted for > > with something like TFIDF so I suppose another weighted popularity > measure > > would be to run the preference matrix through TFIDF to de-weight > > non-differentiating preferences. > > > > On Feb 6, 2014, at 7:14 AM, Ted Dunning wrote: > > > > If you look at the indicator matrix (cooccurrence reduced by LLR), you > will > > usually have asymmetry due to limitations on the number of indicators p= er > > row. > > > > This will give you some interesting results when you look at the column > > sums. I wouldn't call it popularity, but it is an interesting measure. > > > > > > > > On Thu, Feb 6, 2014 at 2:15 PM, Sean Owen wrote: > > > > > I have always defined popularity as just the number of ratings/prefs, > > > yes. You could rank on some kind of 'net promoter score' -- good > > > ratings minus bad ratings -- though that becomes more like 'most > > > liked'. > > > > > > How do you get popularity from similarity -- similarity to what? > > > Ranking by sum of similarities seems more like a measure of how much > > > the item is the 'centroid' of all items. Not necessarily most popular > > > but 'least eccentric'. > > > > > > > > > On Thu, Feb 6, 2014 at 7:41 AM, Tevfik Aytekin < > tevfik.aytekin@gmail.com > > > > > > wrote: > > >> Well, I think what you are suggesting is to define popularity as bei= ng > > >> similar to other items. So in this way most popular items will be > > >> those which are most similar to all other items, like the centroids = in > > >> K-means. > > >> > > >> I would first check the correlation between this definition and the > > >> standard one (that is, the definition of popularity as having the > > >> highest number of ratings). But my intuition is that they are > > >> different things. For example. an item might lie at the center in th= e > > >> similarity space but it might not be a popular item. However, there > > >> might still be some correlation, it would be interesting to check it= . > > >> > > >> hope it helps > > >> > > >> > > >> > > >> > > >> On Wed, Feb 5, 2014 at 3:27 AM, Pat Ferrel > > > wrote: > > >>> Trying to come up with a relative measure of popularity for items i= n > a > > > recommender. Something that could be used to rank items. > > >>> > > >>> The user - item preference matrix would be the obvious thought. Jus= t > > > add the number of preferences per item. Maybe transpose the preferenc= e > > > matrix (the temp DRM created by the recommender), then for each row > > vector > > > (now that a row =3D item) grab the number of non zero preferences. Th= is > > > corresponds to the number of preferences, and would give one measure = of > > > popularity. In the case where the items are not boolean you'd sum the > > > weights. > > >>> > > >>> However it might be a better idea to look at the item-item similari= ty > > > matrix. It doesn't need to be transposed and contains the "important" > > > similarities--as calculated by LLR for example. Here similarity means > > > similarity in which users preferred an item. So summing the non-zero > > > weights would give perhaps an even better relative "popularity" > measure. > > > For the same reason clustering the similarity matrix would yield > > > "important" clusters. > > >>> > > >>> Anyone have intuition about this? > > >>> > > >>> I started to think about this because transposing the user-item > matrix > > > seems to yield a fromat that cannot be sent directly into clustering. > > > > > > > > --089e0111e0da773efb04f1c62c1d--