mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Taste on Mahout
Date Wed, 21 May 2008 16:58:57 GMT
Correlation (per se) between such sparse binary vectors can be very
problematic.

This is a general problem with this kind of data and really needs to be
handled directly.  Not clicking on an item is much less informative than
clicking on an item (so little time, so much to click).  Any system you
build has to deal with that and with coincidence.  For instance, raw
correlation gives 100% match for two people who happen to have clicked on
the same single item.  IF that item is a very popular one, however, this is
not a very interesting fact.

One very simple way of dealing with this was described in
http://citeseer.ist.psu.edu/29096.html .  Since then, I have found other,
more comprehensive techniques but they are considerably more complex.

On Wed, May 21, 2008 at 9:29 AM, Sean Owen <srowen@gmail.com> wrote:

> Interesting, I was just having a separate conversation with another
> developer about best practices when your preferences are binary yes/no
> values rather than a number within a range.
>
> The standard techniques (well, that I have implemented) can have a bit
> of trouble with this sort of input, though it works. Already I wonder
> if we should put in place some new components that address this
> situation:
>
> - a new implementation of Preference that doesn't store a preference
> value, since that is wasteful; it's mere existence is a "true"
>
> - a new implementation of UserCorrelation/ItemCorrelation that is more
> appropriate to binary data. It would probably report the ratio of the
> intersection to union of the two users' or items' preference sets
>
> I will pick up a copy of this book as it sounds like a good read and
> may refresh and expand my knowledge of techniques in this area.
>
>
> Yes comparing every item to every other is expensive. The ways to
> mitigate this are:
>
> - Do it offline, in batch, once a day or hour or something
>  - ... and maybe use Hadoop to do that, yes
>
> - Use sampling -- compare each item to only a random 10% of the other
> items to find items that are among the most similar, if not *the* most
> similar
>
> 10,000 items sounds reasonably small though; even doing it online may
> not be so bad, as long as you provide enough memory to the library for
> caching.
>
>
> Taste also has something like k-means clustering, in the
> "TreeClusteringRecommender". You would ask it to cluster to some
> number of clusters rather than until the cluster similarity reaches
> some threshold.
>
>
>
> On Wed, May 21, 2008 at 7:36 AM, Goel, Ankur <Ankur.Goel@corp.aol.com>
> wrote:
> > Hey Sean,
> >          Thanks for the suggestions. In my case the data-set os only
> > going to tell me if the useer clicked on a particualar item. So lets say
> > there are 10,000 items a user might only have clicked 20 - 30 items. I
> > was thinking more on the lines of building an item similarity table by
> > comparing each item with every other item and retaining only 100 top
> > items decayed by time.
> >
> > So a recommender for a user would use his recent browsing history to
> > figure out top 10 or 20 most similar items.
> >
> > The approach is documented in Toby Segaran's "Collective Intelligence"
> > book and looks simple to implement even though it is costly since every
> > item needs to be compared with every other item. This can be
> > parallelized in way that for M items in a cluster of N machines, each
> > node has to compare M/N items to M items. Since the data-set is going to
> > sparse (no. of items having common users), I believe this would'nt be
> > overwhelming for the cluster.
> >
> > The other approach that I am thinking to reduce the computation cost is
> > to use a clustering algorithm like K-Means that's available in Mahout to
> > cluster similar user/items together and then use clustering information
> > to make recommendations.
> >
> > Any suggestions?
> >
> > Thanks
> > -Ankur
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Tuesday, May 20, 2008 9:37 PM
> > To: mahout-dev@lucene.apache.org; Goel, Ankur
> > Subject: Re: Taste on Mahout
> >
> > + Ankur directly, since I am not sure you are on the dev list.
> >
> > On Tue, May 20, 2008 at 12:06 PM, Sean Owen <srowen@gmail.com> wrote:
> >> All of the algorithms assume a world where you have a continuous range
> >
> >> of ratings from users for items. Obviously a binary yes/no rating can
> >> be mapped into that trivially -- 1 and -1 for example. This causes
> >> some issues, most notably for corrletion-based recommenders where the
> >> correlation can be undefined between two items/users in special cases
> >> that arise from this kind of input -- for example if we overlap in
> >> rating 3 items and I voted "yes" for all 3, then no correlation can be
> >
> >> defined.
> >>
> >> Slope one doesn't run into this particular mathematical wrinkle.
> >>
> >> Also, methods like estimatePreference() are not going to give you
> >> estimates that are always 1 or -1. Again, you could map this back onto
> >> 1 / -1 by rounding or something, just something to note.
> >>
> >> So, in general it will be better if you can map whatever input you
> >> have onto a larger range of input. You will feed more information in,
> >> in this way, as well. For example, maybe you call a recent "yes"
> >> rating a +2, and a recent "no" a -2, and others +1 and -1.
> >>
> >>
> >> The part of slope one that parallelizes very well is the computing of
> >> the item-item diffs. No I have not written this yet.
> >>
> >>
> >> I have committed a first cut at a framework for computing
> >> recommendations in parallel for any recommender. Dig in to
> >> org.apache.mahout.cf.taste.impl.hadoop. In general, none of the
> >> existing recommenders can be parallelized, because they generally need
> >
> >> access to all the data to produce any recommendation.
> >>
> >> But, we can take partial advantage of Hadoop by simply parallelizing
> >> the computation of recommendations for many users across multiple
> >> identical recommender instances. Better than nothing. In this
> >> situation, one of the map or reduce phase is trivial.
> >>
> >> That is what I have committed so far and it works, locally. I am in
> >> the middle of figuring out how to write it for real use on a remote
> >> Hadoop cluster, and how I would go about testing that!
> >>
> >> Do we have any test bed available?
> >>
> >>
> >>
> >> On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur <Ankur.Goel@corp.aol.com>
> > wrote:
> >>> I just realized after going through the wikipedia that slope one is
> >>> applicable when you have ratings for the items.
> >>> In my case, I would be simply working with binary data (Item was
> >>> clicked or not-clicked by user) using Tanimoto coefficient to
> >>> calculate item similarity.
> >>> The idea is to capture the simple intuition "What items have been
> >>> visited most along with this item".
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >>> Sent: Tuesday, May 20, 2008 2:51 PM
> >>> To: mahout-dev@lucene.apache.org
> >>> Subject: RE: Taste on Mahout
> >>>
> >>>
> >>> Hey Sean,
> >>>       I actually plan to use slope-one to start with since
> >>> - Its simple and known to work well.
> >>> - Can be parallelized nicely into the Map-Reduce style.
> >>> I also plan to use Tanimoto coefficient for item-item diffs.
> >>>
> >>> Do we have something on slope-one already in Taste as a part of
> > Mahout ?
> >>>
> >>> At the moment I am going through the available documentation on Taste
> >
> >>> and code that's present in Mahout.
> >>>
> >>> Your suggestions would be greatly appreciated.
> >>>
> >>> Thanks
> >>> -Ankur
> >>>
> >>> -----Original Message-----
> >>> From: Sean Owen [mailto:srowen@gmail.com]
> >>> Sent: Tuesday, April 29, 2008 11:09 PM
> >>> To: mahout-dev@lucene.apache.org; Goel, Ankur
> >>> Subject: Re: Taste on Mahout
> >>>
> >>> I have some Hadoop code mostly ready to go for Taste.
> >>>
> >>> The first thing to do is let you generate recommendations for all
> >>> your users via Hadoop. Unfortunately none of the recommenders truly
> >>> parallelize in the way that MapReduce needs it to -- you need all
> >>> data to compute any recommendation really -- but you can at least get
> >
> >>> paralellization out of this. You can use the framework to run n
> >>> recommenders, each computing 1/nth of all recommendations.
> >>>
> >>> The next application is specific to slope-one. Computing the
> >>> item-item diffs is exactly the kind of thing that MapReduce is good
> >>> for, so, writing a Hadoop job to do this seems like a no-brainer.
> >>>
> >>> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur
> >>> <Ankur.Goel@corp.aol.com>
> >>> wrote:
> >>>> Hi Folks,
> >>>>       What's the status of hadoopifying Taste on Mahout ?
> >>>>  What's been done and what is in progress/pending ?
> >>>>
> >>>>  I am looking using a scalable version of Taste for my project.
> >>>>  So basically trying to figure out what's already done and where  I
> >>>> can pitch in.
> >>>>
> >>>>  Thanks
> >>>>  -Ankur
> >>>>
> >>>
> >>
> >
>



-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message