mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Runo <matthew.r...@gmail.com>
Subject Item Similarity Calculations
Date Mon, 07 Feb 2011 18:33:54 GMT
Hello folks -

It's time for another "question that shouldn't be a question" from Matthew!

I saw by reading javadoc that when implementing item based
recommenders using the GenericItemBasedRecommender that we're supposed
to precompute all the item-item similarities rather than using
something like the UncenteredCosineSimilarity class on-the-fly.

While I can appreciate the time savings of using precomputed
similarities, and I can appreciate the customizability of
pre-computing these.. wouldn't this lead to an upkeep nightmare in the
long run? Every time I add or remove something, I have to compair the
new item to everything else... removing I suppose is somewhat easier
since we just delete the entries and refresh the DataModel.

Every time a new item is added, or an old item is removed, everything
needs to be adjusted - right? Say we have a system with 150,000 items
- that's 22,500,000,000 rows that go into the similarity table
(assuming a dumb loop within a loop style calculator). That would take
a beefy database server just to serve that one table quickly for the
system.

Does it make sense to leave out entries for items that are totally
dissimilar? I assume that might hurt the "long tail" of
recommendations... Are there other optimizations that I'm just not
seeing?

When computing an item similarity it seems that there are two schools
of thought - one using user preferences towards it and other items to
compute, and one using item attributes to compute. I assume that if I
wanted to go towards the item attributes computation then I'm getting
into the clustering algorithms, and that in that case I would build
clusters using the item attributes and go from there?

I'm sorry for the basic questions. Hopefully they have basic answers,
and hopefully this will all help other people who, like me, are not
experts in the fields of statistics and linear equations..

I really appreciate any and all help,

--Matthew Runo

Mime
View raw message