mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: Taste on Mahout
Date Thu, 22 May 2008 12:30:49 GMT
Hey Ted,
        I read the paper on LDA
(http://citeseer.ist.psu.edu/blei03latent.html) and I have to admit I
could not understand how LDA would be any different than PLSI for the
problem setting that I have (user-click history for various users and
urls). May be its my limited statistical knowledge and ML background but
I am making best efforts to learn things as they come along. 

I found the notations to be quite complex and it would be nice if you
could point me to a source offering simpler explanation of LDA model
parameters and their estimation methods as after reading the paper I
could not map those methods into my problem setting.

Since I already have some understading of PLSI and Expectation
Maximization, an explanation describing the role of additional model
parameters and their estimation method would suffice. May be that's
something you could help me with offline.

Thanks
-Ankur



-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, May 21, 2008 10:24 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Taste on Mahout

My suggestion is to build a class of probabilistic models of what people
click on.  You can build some number of models as necessary to describe
your users' histories well.

These model will give you the answers you need.

I can talk this evening a bit about how to do this.  If you want to read
up on it ahead of time, take a look at
http://citeseer.ist.psu.edu/750239.htmland
http://citeseer.ist.psu.edu/blei03latent.html

(hint: consider each person a document and a thing to be clicked as a
word)

On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <Ankur.Goel@corp.aol.com>
wrote:

> Hey Sean,
>          Thanks for the suggestions. In my case the data-set os only 
> going to tell me if the useer clicked on a particualar item. So lets 
> say there are 10,000 items a user might only have clicked 20 - 30 
> items. I was thinking more on the lines of building an item similarity

> table by comparing each item with every other item and retaining only 
> 100 top items decayed by time.
>
> So a recommender for a user would use his recent browsing history to 
> figure out top 10 or 20 most similar items.
>
> The approach is documented in Toby Segaran's "Collective Intelligence"
> book and looks simple to implement even though it is costly since 
> every item needs to be compared with every other item. This can be 
> parallelized in way that for M items in a cluster of N machines, each 
> node has to compare M/N items to M items. Since the data-set is going 
> to sparse (no. of items having common users), I believe this would'nt 
> be overwhelming for the cluster.
>
> The other approach that I am thinking to reduce the computation cost 
> is to use a clustering algorithm like K-Means that's available in 
> Mahout to cluster similar user/items together and then use clustering 
> information to make recommendations.
>
> Any suggestions?
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Tuesday, May 20, 2008 9:37 PM
> To: mahout-dev@lucene.apache.org; Goel, Ankur
> Subject: Re: Taste on Mahout
>
> + Ankur directly, since I am not sure you are on the dev list.
>
> On Tue, May 20, 2008 at 12:06 PM, Sean Owen <srowen@gmail.com> wrote:
> > All of the algorithms assume a world where you have a continuous 
> > range
>
> > of ratings from users for items. Obviously a binary yes/no rating 
> > can be mapped into that trivially -- 1 and -1 for example. This 
> > causes some issues, most notably for corrletion-based recommenders 
> > where the correlation can be undefined between two items/users in 
> > special cases that arise from this kind of input -- for example if 
> > we overlap in rating 3 items and I voted "yes" for all 3, then no 
> > correlation can be
>
> > defined.
> >
> > Slope one doesn't run into this particular mathematical wrinkle.
> >
> > Also, methods like estimatePreference() are not going to give you 
> > estimates that are always 1 or -1. Again, you could map this back 
> > onto
> > 1 / -1 by rounding or something, just something to note.
> >
> > So, in general it will be better if you can map whatever input you 
> > have onto a larger range of input. You will feed more information 
> > in, in this way, as well. For example, maybe you call a recent "yes"
> > rating a +2, and a recent "no" a -2, and others +1 and -1.
> >
> >
> > The part of slope one that parallelizes very well is the computing 
> > of the item-item diffs. No I have not written this yet.
> >
> >
> > I have committed a first cut at a framework for computing 
> > recommendations in parallel for any recommender. Dig in to 
> > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the 
> > existing recommenders can be parallelized, because they generally 
> > need
>
> > access to all the data to produce any recommendation.
> >
> > But, we can take partial advantage of Hadoop by simply parallelizing

> > the computation of recommendations for many users across multiple 
> > identical recommender instances. Better than nothing. In this 
> > situation, one of the map or reduce phase is trivial.
> >
> > That is what I have committed so far and it works, locally. I am in 
> > the middle of figuring out how to write it for real use on a remote 
> > Hadoop cluster, and how I would go about testing that!
> >
> > Do we have any test bed available?
> >
> >
> >
> > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur 
> > <Ankur.Goel@corp.aol.com>
> wrote:
> >> I just realized after going through the wikipedia that slope one is

> >> applicable when you have ratings for the items.
> >> In my case, I would be simply working with binary data (Item was 
> >> clicked or not-clicked by user) using Tanimoto coefficient to 
> >> calculate item similarity.
> >> The idea is to capture the simple intuition "What items have been 
> >> visited most along with this item".
> >>
> >>
> >> -----Original Message-----
> >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >> Sent: Tuesday, May 20, 2008 2:51 PM
> >> To: mahout-dev@lucene.apache.org
> >> Subject: RE: Taste on Mahout
> >>
> >>
> >> Hey Sean,
> >>       I actually plan to use slope-one to start with since
> >> - Its simple and known to work well.
> >> - Can be parallelized nicely into the Map-Reduce style.
> >> I also plan to use Tanimoto coefficient for item-item diffs.
> >>
> >> Do we have something on slope-one already in Taste as a part of
> Mahout ?
> >>
> >> At the moment I am going through the available documentation on 
> >> Taste
>
> >> and code that's present in Mahout.
> >>
> >> Your suggestions would be greatly appreciated.
> >>
> >> Thanks
> >> -Ankur
> >>
> >> -----Original Message-----
> >> From: Sean Owen [mailto:srowen@gmail.com]
> >> Sent: Tuesday, April 29, 2008 11:09 PM
> >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> >> Subject: Re: Taste on Mahout
> >>
> >> I have some Hadoop code mostly ready to go for Taste.
> >>
> >> The first thing to do is let you generate recommendations for all 
> >> your users via Hadoop. Unfortunately none of the recommenders truly

> >> parallelize in the way that MapReduce needs it to -- you need all 
> >> data to compute any recommendation really -- but you can at least 
> >> get
>
> >> paralellization out of this. You can use the framework to run n 
> >> recommenders, each computing 1/nth of all recommendations.
> >>
> >> The next application is specific to slope-one. Computing the 
> >> item-item diffs is exactly the kind of thing that MapReduce is good

> >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> >>
> >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur 
> >> <Ankur.Goel@corp.aol.com>
> >> wrote:
> >>> Hi Folks,
> >>>       What's the status of hadoopifying Taste on Mahout ?
> >>>  What's been done and what is in progress/pending ?
> >>>
> >>>  I am looking using a scalable version of Taste for my project.
> >>>  So basically trying to figure out what's already done and where  
> >>> I can pitch in.
> >>>
> >>>  Thanks
> >>>  -Ankur
> >>>
> >>
> >
>



--
ted

Mime
View raw message