mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Itembased Collaborative Filtering
Date Thu, 14 Oct 2010 20:01:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Itembased Collaborative Filtering (https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering)


Edited by Sebastian Schelter:
---------------------------------------------------------------------
Itembased Collaborative Filtering is a popular way of doing Recommendation Mining.

h3. Terminology

We have *users* that interact with *items* (which can be pretty much anything like books,
videos, news, other users,...). Those users express *preferences* towards the items which
can either be boolean (just modelling that a user likes an item) or numeric (by having a rating
value assigned to the preference). Typically only a small number of preferences is known for
each single user.

h3. Algorithmic problems

Collaborative Filtering algorithms aim to solve the *prediction* problem where the task is
to estimate the preference of a user towards an item which he/she has not yet seen.Item-Based
Collaborative Filtering Recommendation Algorithms

Once algorithm can predict preferences it can also be used to do *Top-N-Recommendation* where
the task is to find the N items a given user might like best. This is usually done by isolating
a set of candidate items, computing the predicted preference of the given user towards them
and returning the highest scoring ones.

If we look at the problem from a mathematical perspective, a *user-item-matrix* is created
from the preference data and the task is to predict the missing entries by finding patterns
in the known entries.

h3. Itembased Collaborative Filtering

A popular approach called "Itembased Collaborative Filtering" estimates a user's preference
towards an item by looking at his/her preferences towards similar items, be aware that similarity
must be thought of as similarity of rating behaviour not similarity of content in this context.

The standard procedure is to pairwisely compare the columns of the user-item-matrix (the item-vectors)
using a similarity measure like pearson-correlation, cosine or loglikelihood to obtain similar
items and use those together with a user ratings to predict his/her preference towards unknown
items.


h3. Map/Reduce implementations

Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative Filtering.

*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes all similar
items. It expects a .csv file with the preference data as input, where each line represents
a single preference in the form _userID,itemID,value_ and outputs pairs of itemIDs with their
associated similarity value.

{code}
  --input (-i) input                                        Path to job input directory.
  --output (-o) output                                      The directory pathname for output.
  --similarityClassname (-s) similarityClassname            Name of distributed similarity
class to instantiate,
                                                            alternatively use one of the predefined
similarities
                                                            (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
                                                            SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
                                                            SIMILARITY_TANIMOTO_COEFFICIENT,
SIMILARITY_UNCENTERED_COSINE,
                                                            SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
  --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem      try to cap the number of similar
items per item to this
                                                            number (default: 100)
  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem    try to cap the number of cooccurrences
per item to this
                                                            number (default: 100)
  --booleanData (-b) booleanData                            Treat input as without pref values
{code}

*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely distributed itembased
recommender. It expects a .csv file with the preference data as input, where each line represents
a single preference in the form _userID,itemID,value_ and outputs userIDs with associated
recommended itemIDs and their scores.

{code}
  --input (-i) input                                        Path to job input directory.
  --output (-o) output                                      The directory pathname for output.
  --numRecommendations (-n) numRecommendations              Number of recommendations per
user
  --usersFile (-u) usersFile                                File of users to recommend for
  --itemsFile (-i) itemsFile                                File of items to recommend for
  --filterFile (-f) filterFile                              File containing comma-separated
userID,itemID pairs. Used to
                                                            exclude the item from the recommendations
for that user
                                                            (optional)
  --booleanData (-b) booleanData                            Treat input as without pref values
  --maxPrefsPerUser maxPrefsPerUser                         Maximum number of preferences
considered per user in final
                                                            recommendation phase
  --maxSimilaritiesPerItem maxSimilaritiesPerItem           Maximum number of similarities
considered per item
  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem    try to cap the number of cooccurrences
per item to this
                                                            number (default: 100)
  --similarityClassname (-s) similarityClassname            Name of distributed similarity
class to instantiate,
                                                            alternatively use one of the predefined
similarities
                                                            (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
                                                            SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
                                                            SIMILARITY_TANIMOTO_COEFFICIENT,
SIMILARITY_UNCENTERED_COSINE,
                                                            SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
{code}

TODO: add more details

h3. Resources

* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms |http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf]
* [Slides: Distributed Itembased Collaborative Filtering with Apache Mahout|http://www.slideshare.net/sscdotopen/mahoutcf]

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message