mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
Date Tue, 20 Jan 2009 22:09:38 GMT
User based recommendations are very close to item based recommendations.

If you view all of your user history as a sparse matrix H that has users x
items, then the users who have similar histories to a particular history x
can be computed by (roughly) H x.

Then if you want to look at what items those users have interacted with, you
multiply again to get H' (H x).

Because multiplication is associative, this is the same as (H' H) x which is
just item level recommendations.

Now, it is true that H shouldn't really be exactly the user histories
(sampling and scaling will intervene) and it is true that H x isn't really
quite multiplication (you probably will want to trim the number of non-zero
elements, for example) and it is true that because of all these imprecisions
that H' (H x) isn't quite the same as (H' H) x.  It is pretty darned close,
however, and the pattern of computation is substantially the same no matter
what you do.

If you look at the cost you naturally start by dividing the cost into
on-line and off-line components.  The on-line cost is what it costs you to
do each recommendation.  For user based recommendation, this consists of two
matrix-vector multiplications.  For item based recommendation, this consists
of one matrix vector multiplication.  The off-line costs are 0 and one
matrix-matrix product respectively.  Since off-line computation is vastly
cheaper than on-line computation, it usually behooves one to move
computation off-line.  Also, note that the user based recommendation
involves multiplication by H' and by H.  Unless you keep both versions in
memory, you will pay a terrible cost.  That makes user based recs even more
expensive than they first appear.

Latent variable techniques change the structure of H to be something like a
product decomposition.  For SVD, we have H = U S V' where S is a small
diagonal matrix and U' U = I and V' V = I.  This means that H' H = V S^2 V'
which can make computing H' H x more or less expensive depending on
implementation, but, more importantly, it smooths H' H substantially so that
elements that would have been zero are not zero (in a good way).  Other
latent variable techniques like LDA use different terminology and may not
actually do multiplication, but the pattern of computation is still the same
and many of the conclusions still hold at least approximately.

Almost all of the work in recommendations centers around small changes in
exactly what the matrix multiplication really means and how many non-zero
elements are retained.  Looking at the problem in a unified way helps
understand how to make trade-offs in the architecture.

On Tue, Jan 20, 2009 at 11:37 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> Boolean* classes sped up things for me:
>
>    UserSimilarity similarity = new
> BooleanTanimotoCoefficientSimilarity(model);
>    hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY,
> similarity, model);
>    recommender = new BooleanUserGenericUserBasedRecommender(model, hood,
> similarity);
>
> Sean did recommend using Item-based recommender when the number of items is
> relatively low compared to the number of users, but we only have Boolean
> flavour of User-based recommender in svn for now.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Sean Owen <srowen@gmail.com>
> > To: mahout-dev@lucene.apache.org
> > Sent: Tuesday, January 20, 2009 7:26:07 AM
> > Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
> >
> > Yes log-likelihood is already implemented as well as a similarity
> > metric, no problem.
> >
> > What I need to do is cook up a quick DataModel that reads your file
> > format. Is it really like this: user ID followed by item IDs that are
> > associated? how can I tell when a line specifies the opposite, item
> > followed by user IDs? the former is easier, BTW.
> >
> > [1234: 324, 555, 333]
> >
> > Then the code to build a recommender is basically:
> >
> > DataModel model = ...; // the custom DataModel I'll make
> > ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> > // can also try ... = new TanimotoCoefficientSimilarity(model);
> > similarity = new CachingItemSimilarity(similarity); // for speed
> > Recommender recommender = new GenericItemBasedRecommender(model,
> similarity);
> > Listrecommendations = recommender.recommend("1234", 10);
> >
> > Your data set is big but not so large that it couldn't fit in memory,
> > I think. For now I think this easily runs on a reasonably beefy
> > machine -- it's going to need a lot of RAM to cache lots of item-item
> > similarities or else that computation will slow things down a lot.
> >
> > Easy enough to try and see how it flies.
> >
> > Sean
> >
> > On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > > Hi Sean,
> > >        Thanks for helping out here. The data can be assumed to be in
> > > either form mentioned below, since both forms are interchangeable:-
> > >
> > > [userA: item1, item2, item3 ... ]
> > > OR
> > > [item1: userA, userB, userC ..]
> > >
> > > Each user and each item is assigned a unique identifier. The ratings
> can
> > > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > > Thing to note here is that in case of 0 the item does not exist in the
> > > user history. So what we have essentially is a sparse representation
> > > where 0's are not stored at all.
> > >
> > > As for which one is more (user/item) from the dataset we have
> relatively
> > > high number of users and less items. There are around 200 - 300
> thousand
> > > unique items but expected to grow to 1 - 2 million. So I think item
> > > based recommender sounds like something we can try out.
> > >
> > > About Tanimoto measure, I thought of using it in hierarchical
> clustering
> > > but Ted suggested it might not solve the purpose. He suggested that we
> > > can try computing the log-likelihood of co-occurrence of items.
> > >
> > > I would like to try out both the item based recommender you suggested
> > > and also the log-likelihood approach. Do we have the map-red version of
> > > log-likelihood code in Mahout?
> > >
> > > Ted, any thoughts?
> > >
> > > Regards
> > > -Ankur
>
>


-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message