Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of vishal.santoshi@gmail.com
 designates 209.85.212.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJwFCa2_-Swcdt7BQ59d6Kn2q7P5TgBVm2JA1hvUv=z_n5ujnA@mail.gmail.com>
References: 
 <CAMq=OU7kF52Nd6Dc4QWYh1UXz9bDi2EdVJRnaUip=GLYG1NdxQ@mail.gmail.com>
	<CAEccTyymkz6GNyaa-zj1xRfjhiCnyp3F_ju1_0omBGfcrimw4Q@mail.gmail.com>
	<CAMq=OU5CNhzRz1cAKGjKX9Uw9QhCURwJb_pRyO9qYXSi+pLhMw@mail.gmail.com>
	<CAEccTyw3THgvL7Wd9eFG868XfGisBSSV=gh7jtN5myjuU_11Xg@mail.gmail.com>
	<CAMq=OU4t5=8f_5vqGCcHFFnGLzh=HVD7rKAGhb4w3DcXotkHyQ@mail.gmail.com>
	<CAEccTyyT8bYEAVkKdHHHyZZJKKYKr_4c7Qu4=bO=HeSSJ8nzqQ@mail.gmail.com>
	<CAMq=OU5DfVHD2iXbPkeFVu7jj8-evbTvNET2XkqYUUnoHKR5EQ@mail.gmail.com>
	<CAEccTyxJKfBxS6VXFsdRaxsg=+GiO5VcLfBTvk41c-iHCKr2xw@mail.gmail.com>
	<CAMq=OU7B5tyPBySTj73cWFP+q+mVE-qgaCniWp_zE0h=T6PbYA@mail.gmail.com>
	<CAJwFCa2_-Swcdt7BQ59d6Kn2q7P5TgBVm2JA1hvUv=z_n5ujnA@mail.gmail.com>
Date: Tue, 25 Oct 2011 12:41:35 -0400
Message-ID: 
 <CAMq=OU5TO95q4WdQLFino+yNZiQLq5=z2dqimgqVcHu-tTUBzg@mail.gmail.com>
Subject: Re: MinHash/ItemBased
From: Vishal Santoshi <vishal.santoshi@gmail.com>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=f46d04426a687c5cde04b0223626

--f46d04426a687c5cde04b0223626
Content-Type: text/plain; charset=ISO-8859-1

>> Minhash finds very, very similar documents.  It is usually considered for
>> tasks such as duplicate detection.  For recommendation, I would think
that
>> you would like something broader.


That is essentially my grouse with MinHash.  Google has enough history on a
user ( or might I say an ability to browse through an year worth click
pattern of a user ) .
That makes their reco, broader ( the more the data the more it is likely to
ascertain good defining clusters ). The exactness in min hashes is what
worries me. It does
not seem to create "realms" or a sense of what users wants but more of a
sense of what user ends up seeing.
Though they do say that they "minhash"  under  the broad universe an article
falls under ( science etc )

I have not looked at PLSI and thus not sure how that combination with
MinHash modifies the recommendation.


On Tue, Oct 25, 2011 at 12:22 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> My own preference for this kind of recommendation would be to recommend
> words and phrases and then use a search engine to find the articles that
> have those words and phrases in them.  Engaging with an article would be
> tantamount to showing interest in all the words and phrases associated with
> the article.  To avoid floods of data there, I would sparsify that by using
> LLR to find characteristic terms and phrases for each article.
>
> Item based recommendation would only require recent history for new users.
>  Their first page view would not be very informative, but after their first
> search or document view, they would be good to go.
>
> The virtue of this approach is that the set of words and phrases is fairly
> static and thus the recommendations would not need frequent update.
>
> A slightly simpler approach would be to simply search for words and phrases
> that occur anomalously often in the documents the user has engaged with.
>  That can work, but it will not exhibit any spreading of terms to related
> terms and thus will present only very similar documents.
>
> With either of these approaches, your data volumes would be fairly modest.
>
> Minhash finds very, very similar documents.  It is usually considered for
> tasks such as duplicate detection.  For recommendation, I would think that
> you would like something broader.
>
> On Tue, Oct 25, 2011 at 9:07 AM, Vishal Santoshi
> <vishal.santoshi@gmail.com>wrote:
>
> > Yep, Please keep me posted.
> > BTW , this is exactly why MinHash picked my curiosity and that seems to
> be
> > affirmed by
> >
> >
> >
> http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce
> >
> > MinHash scales , such that the offline periodic component ( based on
> > hadoop/mahout yes mahout has a Minhash based clustering Driver )  seems
> > promising.
> > Again please keep the forum posted on how you go about doing this.
> >
> > Regards,
> >
> > Vishal.
> >
> > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <srowen@gmail.com> wrote:
> >
> > > Oh I see, right.
> > >
> > > Well, one general strategy is to use Hadoop to compute the
> > > recommendations regularly, but not nearly in real-time. Then, use the
> > > latest data to imperfectly update the recommendations in real-time.
> > > So, you always have slightly stale recommendations, and item-item
> > > similarities to fall back on, and are reloading those periodically.
> > > Then you're trying to update any recently changed item or user in
> > > real-time using item-based recommendation, which can be fast.
> > >
> > > It's a really big topic in its own right, and there's no complete
> > > answer for you here, but you can piece this together from Mahout
> > > rather than build it from scratch.)
> > >
> > > (This is more or less exactly what I have been working on separately,
> > > a hybrid Hadoop-based / real-time recommender that can handle this
> > > scale but also respond reasonably to new data.)
> > >
> > > On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi
> > > <vishal.santoshi@gmail.com> wrote:
> > > > They are all active in a day. I am talking about 8.3 million active
> > users
> > > a
> > > > day.
> > > > A significant fraction of them will be new users ( say about 2-3
> > million
> > > of
> > > > them ).
> > > > Further the churn on items is likely to make historical
> recommendations
> > > > obsolete.
> > > > Thus if I have recommendations that were good of user A yesterday,
> they
> > > are
> > > > likely to be far less a weight as of today.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <srowen@gmail.com>
> wrote:
> > > >
> > > >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi
> > > >> <vishal.santoshi@gmail.com> wrote:
> > > >> > In our case the preferences is  a user clicking on an article (
> > which
> > > >> > doubles as an item ).
> > > >> > And these articles are introduced at a frequent rate. Thus the
> > number
> > > of
> > > >> new
> > > >> > items that
> > > >> > occur in the dataset has a very frequent churn and thus not
> > > necessarily
> > > >> > having any history.
> > > >> > Of course we need to recommend the latest item.
> > > >>
> > > >> OK, but I'm still not seeing why all users need an update every
> time.
> > > >> Surely most of the 8.3M users aren't even active in a given day.
> > > >>
> > > >
> > >
> >
>

--f46d04426a687c5cde04b0223626--