User based recommendations are very close to item based recommendations.
If you view all of your user history as a sparse matrix H that has users x
items, then the users who have similar histories to a particular history x
can be computed by (roughly) H x.
Then if you want to look at what items those users have interacted with, you
multiply again to get H' (H x).
Because multiplication is associative, this is the same as (H' H) x which is
just item level recommendations.
Now, it is true that H shouldn't really be exactly the user histories
(sampling and scaling will intervene) and it is true that H x isn't really
quite multiplication (you probably will want to trim the number of nonzero
elements, for example) and it is true that because of all these imprecisions
that H' (H x) isn't quite the same as (H' H) x. It is pretty darned close,
however, and the pattern of computation is substantially the same no matter
what you do.
If you look at the cost you naturally start by dividing the cost into
online and offline components. The online cost is what it costs you to
do each recommendation. For user based recommendation, this consists of two
matrixvector multiplications. For item based recommendation, this consists
of one matrix vector multiplication. The offline costs are 0 and one
matrixmatrix product respectively. Since offline computation is vastly
cheaper than online computation, it usually behooves one to move
computation offline. Also, note that the user based recommendation
involves multiplication by H' and by H. Unless you keep both versions in
memory, you will pay a terrible cost. That makes user based recs even more
expensive than they first appear.
Latent variable techniques change the structure of H to be something like a
product decomposition. For SVD, we have H = U S V' where S is a small
diagonal matrix and U' U = I and V' V = I. This means that H' H = V S^2 V'
which can make computing H' H x more or less expensive depending on
implementation, but, more importantly, it smooths H' H substantially so that
elements that would have been zero are not zero (in a good way). Other
latent variable techniques like LDA use different terminology and may not
actually do multiplication, but the pattern of computation is still the same
and many of the conclusions still hold at least approximately.
Almost all of the work in recommendations centers around small changes in
exactly what the matrix multiplication really means and how many nonzero
elements are retained. Looking at the problem in a unified way helps
understand how to make tradeoffs in the architecture.
On Tue, Jan 20, 2009 at 11:37 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:
> Hello,
>
> Boolean* classes sped up things for me:
>
> UserSimilarity similarity = new
> BooleanTanimotoCoefficientSimilarity(model);
> hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY,
> similarity, model);
> recommender = new BooleanUserGenericUserBasedRecommender(model, hood,
> similarity);
>
> Sean did recommend using Itembased recommender when the number of items is
> relatively low compared to the number of users, but we only have Boolean
> flavour of Userbased recommender in svn for now.
>
> Otis
> 
> Sematext  http://sematext.com/  Lucene  Solr  Nutch
>
>
>
>  Original Message 
> > From: Sean Owen <srowen@gmail.com>
> > To: mahoutdev@lucene.apache.org
> > Sent: Tuesday, January 20, 2009 7:26:07 AM
> > Subject: Re: RE: RE: [jira] Commented: (MAHOUT19) Hierarchial clusterer
> >
> > Yes loglikelihood is already implemented as well as a similarity
> > metric, no problem.
> >
> > What I need to do is cook up a quick DataModel that reads your file
> > format. Is it really like this: user ID followed by item IDs that are
> > associated? how can I tell when a line specifies the opposite, item
> > followed by user IDs? the former is easier, BTW.
> >
> > [1234: 324, 555, 333]
> >
> > Then the code to build a recommender is basically:
> >
> > DataModel model = ...; // the custom DataModel I'll make
> > ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> > // can also try ... = new TanimotoCoefficientSimilarity(model);
> > similarity = new CachingItemSimilarity(similarity); // for speed
> > Recommender recommender = new GenericItemBasedRecommender(model,
> similarity);
> > Listrecommendations = recommender.recommend("1234", 10);
> >
> > Your data set is big but not so large that it couldn't fit in memory,
> > I think. For now I think this easily runs on a reasonably beefy
> > machine  it's going to need a lot of RAM to cache lots of itemitem
> > similarities or else that computation will slow things down a lot.
> >
> > Easy enough to try and see how it flies.
> >
> > Sean
> >
> > On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > > Hi Sean,
> > > Thanks for helping out here. The data can be assumed to be in
> > > either form mentioned below, since both forms are interchangeable:
> > >
> > > [userA: item1, item2, item3 ... ]
> > > OR
> > > [item1: userA, userB, userC ..]
> > >
> > > Each user and each item is assigned a unique identifier. The ratings
> can
> > > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > > Thing to note here is that in case of 0 the item does not exist in the
> > > user history. So what we have essentially is a sparse representation
> > > where 0's are not stored at all.
> > >
> > > As for which one is more (user/item) from the dataset we have
> relatively
> > > high number of users and less items. There are around 200  300
> thousand
> > > unique items but expected to grow to 1  2 million. So I think item
> > > based recommender sounds like something we can try out.
> > >
> > > About Tanimoto measure, I thought of using it in hierarchical
> clustering
> > > but Ted suggested it might not solve the purpose. He suggested that we
> > > can try computing the loglikelihood of cooccurrence of items.
> > >
> > > I would like to try out both the item based recommender you suggested
> > > and also the loglikelihood approach. Do we have the mapred version of
> > > loglikelihood code in Mahout?
> > >
> > > Ted, any thoughts?
> > >
> > > Regards
> > > Ankur
>
>

Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
6503240110, ext. 738
8584140013 (m)
