[ https://issues.apache.org/jira/browse/MAHOUT420?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Sebastian Schelter updated MAHOUT420:

Attachment: MAHOUT4203.patch
> Improving the distributed itembased recommender
> 
>
> Key: MAHOUT420
> URL: https://issues.apache.org/jira/browse/MAHOUT420
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Attachments: MAHOUT4202.patch, MAHOUT4202a.patch, MAHOUT4203.patch, MAHOUT420.patch
>
>
> A summary of the discussion on the mailing list:
> Extend the distributed itembased recommender from using only simple cooccurrence counts
to using the standard computations of an itembased recommender as defined in Sarwar et al
"ItemBased Collaborative Filtering Recommendation Algorithms" (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
> What the distributed recommender generally does is that it computes the prediction values
for all users towards all items those users have not rated yet. And the computation is done
in the following way:
> u = a user
> i = an item not yet rated by u
> N = all items cooccurring with i
> Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
> The formula used in the paper which is used by GenericItemBasedRecommender.doEstimatePreference(...)
too, looks very similar to the one above:
> u = a user
> i = an item not yet rated by u
> N = all items similar to i (where similarity is usually computed by pairwisely comparing
the itemvectors of the useritem matrix)
> Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all n from
N: abs(similarity(i,n)))
> There are only 2 differences:
> a) instead of the cooccurrence count, certain similarity measures like pearson or cosine
can be used
> b) the resulting value is normalized by the sum of the similarities
> To overcome difference a) we would only need to replace the part that computes the cooccurrence
matrix with the code from ItemSimilarityJob or the code introduced in MAHOUT418, then we
could compute arbitrary similarity matrices and use them in the same way the cooccurrence
matrix is currently used. We just need to separate steps up to creating the cooccurrence
matrix from the rest, which is simple since they're already different MR jobs.
> Regarding difference b) from a first look at the implementation I think it should be
possible to transfer the necessary similarity matrix entries from the PartialMultiplyMapper
to the AggregateAndRecommendReducer to be able to compute the normalization value in the denominator
of the formula. This will take a little work, yes, but is still straightforward. It canbe
in the "common" part of the process, done after the similarity matrix is generated.
> I think work on this issue should wait until MAHOUT418 is resolved as the implementation
here depends on how the pairwise similarities will be computed in the future.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
