mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Item Similarity Calculations
Date Mon, 07 Feb 2011 18:45:02 GMT
If you can live with only precomputing the similarities every once and 
then, you can use 
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob to run the 
computation in hadoop.

The result of this job can be put in a .txt file and can be loaded into 
an ItembasedRecommender via 
o.a.m.cf.taste.impl.similarity.file.FileItemSimilarity. The hadoop job 
also offers an option to only keep the top n similar items per item 
which gives you control over the result size.

--sebastian


On 07.02.2011 19:33, Matthew Runo wrote:
> Hello folks -
>
> It's time for another "question that shouldn't be a question" from Matthew!
>
> I saw by reading javadoc that when implementing item based
> recommenders using the GenericItemBasedRecommender that we're supposed
> to precompute all the item-item similarities rather than using
> something like the UncenteredCosineSimilarity class on-the-fly.
>
> While I can appreciate the time savings of using precomputed
> similarities, and I can appreciate the customizability of
> pre-computing these.. wouldn't this lead to an upkeep nightmare in the
> long run? Every time I add or remove something, I have to compair the
> new item to everything else... removing I suppose is somewhat easier
> since we just delete the entries and refresh the DataModel.
>
> Every time a new item is added, or an old item is removed, everything
> needs to be adjusted - right? Say we have a system with 150,000 items
> - that's 22,500,000,000 rows that go into the similarity table
> (assuming a dumb loop within a loop style calculator). That would take
> a beefy database server just to serve that one table quickly for the
> system.
>
> Does it make sense to leave out entries for items that are totally
> dissimilar? I assume that might hurt the "long tail" of
> recommendations... Are there other optimizations that I'm just not
> seeing?
>
> When computing an item similarity it seems that there are two schools
> of thought - one using user preferences towards it and other items to
> compute, and one using item attributes to compute. I assume that if I
> wanted to go towards the item attributes computation then I'm getting
> into the clustering algorithms, and that in that case I would build
> clusters using the item attributes and go from there?
>
> I'm sorry for the basic questions. Hopefully they have basic answers,
> and hopefully this will all help other people who, like me, are not
> experts in the fields of statistics and linear equations..
>
> I really appreciate any and all help,
>
> --Matthew Runo


Mime
View raw message