mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Item Similarity Calculations
Date Mon, 07 Feb 2011 22:49:02 GMT
That recommendation in the javadoc may be a little too strong. As you say
it's really a trade-off between memory consumption, run-time speed, and cost
of updates. At a certain scale you can't fit in memory.

You can completely pre-compute with GenericItemSimilarity.
You can wrap an implementation with CachingItemSimilarity to have it at
least remember what it has computed already, rather than precompute.
You can just use an ItemSimilarity directly -- slowest in theory, but no
memory overhead.

Pruning can be a great thing. You want to prune the item-item similarities
that have the least information first. That is not necessarily a function of
the similarity value, though for certain algorithms it can be. For example,
when using a similarity based on the Pearson correlation (aka cosine
similarity), a value near 0.0 does suggest the item-item similarity has no
information. But a 0.0 does not always mean this -- it depends on the
computation. More generally, item-item similarities based on fewer data
points are probably less reliable and therefore better to prune.

You don't need clustering to produce an attribute- or content-based notion
of item similarity. Broadly, this notion of similarity is an input to these
processes, not an output.

As far as Mahout goes, the world of recommendations thinks of content-based
item similarity as out of scope. It's something quite domain-specific. You
can plug in your own notion of ItemSimilarity but it's necessarily your own,
for your domain. Clustering goes a little farther : if you can somehow map
those attributes to numerical values in a vector, it can compute similarity.
But the selection of features is often, again, domain-specific and up to

On Mon, Feb 7, 2011 at 6:33 PM, Matthew Runo <> wrote:

> Hello folks -
> It's time for another "question that shouldn't be a question" from Matthew!
> I saw by reading javadoc that when implementing item based
> recommenders using the GenericItemBasedRecommender that we're supposed
> to precompute all the item-item similarities rather than using
> something like the UncenteredCosineSimilarity class on-the-fly.
> While I can appreciate the time savings of using precomputed
> similarities, and I can appreciate the customizability of
> pre-computing these.. wouldn't this lead to an upkeep nightmare in the
> long run? Every time I add or remove something, I have to compair the
> new item to everything else... removing I suppose is somewhat easier
> since we just delete the entries and refresh the DataModel.
> Every time a new item is added, or an old item is removed, everything
> needs to be adjusted - right? Say we have a system with 150,000 items
> - that's 22,500,000,000 rows that go into the similarity table
> (assuming a dumb loop within a loop style calculator). That would take
> a beefy database server just to serve that one table quickly for the
> system.
> Does it make sense to leave out entries for items that are totally
> dissimilar? I assume that might hurt the "long tail" of
> recommendations... Are there other optimizations that I'm just not
> seeing?
> When computing an item similarity it seems that there are two schools
> of thought - one using user preferences towards it and other items to
> compute, and one using item attributes to compute. I assume that if I
> wanted to go towards the item attributes computation then I'm getting
> into the clustering algorithms, and that in that case I would build
> clusters using the item attributes and go from there?
> I'm sorry for the basic questions. Hopefully they have basic answers,
> and hopefully this will all help other people who, like me, are not
> experts in the fields of statistics and linear equations..
> I really appreciate any and all help,
> --Matthew Runo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message