mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: Item Similarity Calculations
Date Mon, 07 Feb 2011 18:45:02 GMT
If you can live with only precomputing the similarities every once and 
then, you can use to run the 
computation in hadoop.

The result of this job can be put in a .txt file and can be loaded into 
an ItembasedRecommender via The hadoop job 
also offers an option to only keep the top n similar items per item 
which gives you control over the result size.


On 07.02.2011 19:33, Matthew Runo wrote:
> Hello folks -
> It's time for another "question that shouldn't be a question" from Matthew!
> I saw by reading javadoc that when implementing item based
> recommenders using the GenericItemBasedRecommender that we're supposed
> to precompute all the item-item similarities rather than using
> something like the UncenteredCosineSimilarity class on-the-fly.
> While I can appreciate the time savings of using precomputed
> similarities, and I can appreciate the customizability of
> pre-computing these.. wouldn't this lead to an upkeep nightmare in the
> long run? Every time I add or remove something, I have to compair the
> new item to everything else... removing I suppose is somewhat easier
> since we just delete the entries and refresh the DataModel.
> Every time a new item is added, or an old item is removed, everything
> needs to be adjusted - right? Say we have a system with 150,000 items
> - that's 22,500,000,000 rows that go into the similarity table
> (assuming a dumb loop within a loop style calculator). That would take
> a beefy database server just to serve that one table quickly for the
> system.
> Does it make sense to leave out entries for items that are totally
> dissimilar? I assume that might hurt the "long tail" of
> recommendations... Are there other optimizations that I'm just not
> seeing?
> When computing an item similarity it seems that there are two schools
> of thought - one using user preferences towards it and other items to
> compute, and one using item attributes to compute. I assume that if I
> wanted to go towards the item attributes computation then I'm getting
> into the clustering algorithms, and that in that case I would build
> clusters using the item attributes and go from there?
> I'm sorry for the basic questions. Hopefully they have basic answers,
> and hopefully this will all help other people who, like me, are not
> experts in the fields of statistics and linear equations..
> I really appreciate any and all help,
> --Matthew Runo

View raw message