Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 53F359D43 for ; Tue, 25 Oct 2011 16:42:04 +0000 (UTC) Received: (qmail 17926 invoked by uid 500); 25 Oct 2011 16:42:03 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 17857 invoked by uid 500); 25 Oct 2011 16:42:03 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 17849 invoked by uid 99); 25 Oct 2011 16:42:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2011 16:42:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vishal.santoshi@gmail.com designates 209.85.212.42 as permitted sender) Received: from [209.85.212.42] (HELO mail-vw0-f42.google.com) (209.85.212.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2011 16:41:57 +0000 Received: by vwl1 with SMTP id 1so1488637vwl.1 for ; Tue, 25 Oct 2011 09:41:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=d34he/YcHEOAbHVGT8OF1b3VN8OVFey2GFgNyYzIOU4=; b=NsXumLsgnrX4ODrUORFa77r6q/S93C9zWtExQJ3jmHfhCvQTx4607mRZOukBnYhCIm RNRa4Qy2jjWknHKBdOFsB3V4fLxLBX1OsGcYIemZe4ScSQRMZxER0ekcg59+j7axSXHx u15CLHanKLLSd9pV7bGQ8/JA0H4jgEh9ZF5H4= MIME-Version: 1.0 Received: by 10.68.156.1 with SMTP id wa1mr57757324pbb.58.1319560896339; Tue, 25 Oct 2011 09:41:36 -0700 (PDT) Received: by 10.143.156.5 with HTTP; Tue, 25 Oct 2011 09:41:35 -0700 (PDT) In-Reply-To: References: Date: Tue, 25 Oct 2011 12:41:35 -0400 Message-ID: Subject: Re: MinHash/ItemBased From: Vishal Santoshi To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=f46d04426a687c5cde04b0223626 --f46d04426a687c5cde04b0223626 Content-Type: text/plain; charset=ISO-8859-1 >> Minhash finds very, very similar documents. It is usually considered for >> tasks such as duplicate detection. For recommendation, I would think that >> you would like something broader. That is essentially my grouse with MinHash. Google has enough history on a user ( or might I say an ability to browse through an year worth click pattern of a user ) . That makes their reco, broader ( the more the data the more it is likely to ascertain good defining clusters ). The exactness in min hashes is what worries me. It does not seem to create "realms" or a sense of what users wants but more of a sense of what user ends up seeing. Though they do say that they "minhash" under the broad universe an article falls under ( science etc ) I have not looked at PLSI and thus not sure how that combination with MinHash modifies the recommendation. On Tue, Oct 25, 2011 at 12:22 PM, Ted Dunning wrote: > My own preference for this kind of recommendation would be to recommend > words and phrases and then use a search engine to find the articles that > have those words and phrases in them. Engaging with an article would be > tantamount to showing interest in all the words and phrases associated with > the article. To avoid floods of data there, I would sparsify that by using > LLR to find characteristic terms and phrases for each article. > > Item based recommendation would only require recent history for new users. > Their first page view would not be very informative, but after their first > search or document view, they would be good to go. > > The virtue of this approach is that the set of words and phrases is fairly > static and thus the recommendations would not need frequent update. > > A slightly simpler approach would be to simply search for words and phrases > that occur anomalously often in the documents the user has engaged with. > That can work, but it will not exhibit any spreading of terms to related > terms and thus will present only very similar documents. > > With either of these approaches, your data volumes would be fairly modest. > > Minhash finds very, very similar documents. It is usually considered for > tasks such as duplicate detection. For recommendation, I would think that > you would like something broader. > > On Tue, Oct 25, 2011 at 9:07 AM, Vishal Santoshi > wrote: > > > Yep, Please keep me posted. > > BTW , this is exactly why MinHash picked my curiosity and that seems to > be > > affirmed by > > > > > > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > > > MinHash scales , such that the offline periodic component ( based on > > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > > promising. > > Again please keep the forum posted on how you go about doing this. > > > > Regards, > > > > Vishal. > > > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen wrote: > > > > > Oh I see, right. > > > > > > Well, one general strategy is to use Hadoop to compute the > > > recommendations regularly, but not nearly in real-time. Then, use the > > > latest data to imperfectly update the recommendations in real-time. > > > So, you always have slightly stale recommendations, and item-item > > > similarities to fall back on, and are reloading those periodically. > > > Then you're trying to update any recently changed item or user in > > > real-time using item-based recommendation, which can be fast. > > > > > > It's a really big topic in its own right, and there's no complete > > > answer for you here, but you can piece this together from Mahout > > > rather than build it from scratch.) > > > > > > (This is more or less exactly what I have been working on separately, > > > a hybrid Hadoop-based / real-time recommender that can handle this > > > scale but also respond reasonably to new data.) > > > > > > On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > > > wrote: > > > > They are all active in a day. I am talking about 8.3 million active > > users > > > a > > > > day. > > > > A significant fraction of them will be new users ( say about 2-3 > > million > > > of > > > > them ). > > > > Further the churn on items is likely to make historical > recommendations > > > > obsolete. > > > > Thus if I have recommendations that were good of user A yesterday, > they > > > are > > > > likely to be far less a weight as of today. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen > wrote: > > > > > > > >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > > > >> wrote: > > > >> > In our case the preferences is a user clicking on an article ( > > which > > > >> > doubles as an item ). > > > >> > And these articles are introduced at a frequent rate. Thus the > > number > > > of > > > >> new > > > >> > items that > > > >> > occur in the dataset has a very frequent churn and thus not > > > necessarily > > > >> > having any history. > > > >> > Of course we need to recommend the latest item. > > > >> > > > >> OK, but I'm still not seeing why all users need an update every > time. > > > >> Surely most of the 8.3M users aren't even active in a given day. > > > >> > > > > > > > > > > --f46d04426a687c5cde04b0223626--