mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Recommender Documentation
Date Fri, 26 Aug 2011 16:11:01 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Recommender Documentation (https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation)


Edited by Sean Owen:
---------------------------------------------------------------------
h2. Overview

_This documentation concerns the non-distributed, non-Hadoop-based recommender engine / collaborative
filtering code inside Mahout. It was formerly a separate project called "Taste" and has continued
development inside Mahout alongside other Hadoop-based code. It may be viewed as a somewhat
separate, older, more comprehensive and more mature aspect of this code, compared to current
development efforts focusing on Hadoop-based distributed recommenders. This remains the best
entry point into Mahout recommender engines of all kinds._

A Mahout-based collaborative filtering engine takes users' preferences for items ("tastes")
and returns estimated preferences for other items. For example, a site that sells books or
CDs could easily use Mahout to figure out, from past purchase data, which CDs a customer might
be interested in listening to.

Mahout provides a rich set of components from which you can construct a customized recommender
system from a selection of algorithms. Mahout is designed to be enterprise-ready; it's designed
for performance, scalability and flexibility.

Mahout recommenders are not just for Java; it can be run as an external server which exposes
recommendation logic to your application via web services and HTTP.

Top-level packages define the Mahout interfaces to these key abstractions:
* DataModel
* UserSimilarity
* ItemSimilarity
* UserNeighborhood
* Recommender

Subpackages of org.apache.mahout.cf.taste.impl hold implementations of these interfaces. These
are the pieces from which you will build your own recommendation engine. That's it! For the
academically inclined, Mahout supports both *memory-based*, *item-based* recommender systems,
*slope one* recommenders, and a couple other experimental implementations. It does not currently
support *model-based* recommenders.

h2. Architecture

!https://cwiki.apache.org/confluence/download/attachments/22872433/taste-architecture.png!

This diagram shows the relationship between various Mahout components in a user-based recommender.
An item-based recommender system is similar except that there are no PreferenceInferrers or
Neighborhood algorithms involved.

h3. Recommender
A Recommender is the core abstraction in Mahout. Given a DataModel, it can produce recommendations.
Applications will most likely use the GenericUserBasedRecommender implementation GenericItemBasedRecommender,
possibly decorated by CachingRecommender.

h3. DataModel
A DataModel is the interface to information about user preferences. An implementation might
draw this data from any source, but a database is the most likely source. Mahout provides
MySQLJDBCDataModel, for example, to access preference data from a database via JDBC and MySQL.
Another exists for PostgreSQL. Mahout also provides a FileDataModel.

There are no abstractions for a user or item in the object model (not anymore). Users and
items are identified solely by an ID value in the framework. Further, this ID value must be
numeric; it is a Java long type through the APIs. A Preference object or PreferenceArray object
encapsulates the relation between user and preferred items (or items and users preferring
them).

Finally, Mahout supports, in various ways, a so-called "boolean" data model in which users
do not express preferences of varying strengths for items, but simply express an association
or none at all. For example, while users might express a preference from 1 to 5 in the context
of a movie recommender site, there may be no notion of a preference value between users and
pages in the context of recommending pages on a web site: there is only a notion of an association,
or none, between a user and pages that have been visited.

h3. UserSimilarity
A UserSimilarity defines a notion of similarity between two Users. This is a crucial part
of a recommendation engine. These are attached to a Neighborhood implementation. ItemSimilarities
are analagous, but find similarity between Items.

h3. UserNeighborhood
In a user-based recommender, recommendations are produced by finding a "neighborhood" of similar
users near a given user. A UserNeighborhood defines a means of determining that neighborhood
— for example, nearest 10 users. Implementations typically need a UserSimilarity
to operate.

h2. Requirements
h3. Required

* [Java/ J2SE 6.0|http://www.java.com/getjava/index.jsp]

h3. Optional
* [Apache Maven|http://maven.apache.org]  2.2.1 or later, if you want to build from source
or build examples. (Mac users note that even OS X 10.5 ships with Maven 2.0.6, which will
not work.)
* Mahout web applications require a [Servlet 2.3+|http://java.sun.com/products/servlet/index.jsp]
container, such as [Apache Tomcat|http://jakarta.apache.org/tomcat/]. It may in fact work
with oldercontainers with slight modification.

h2. Demo

To build and run the demo, follow the instructions below, which are written for Unix-like
operating systems:

* Obtain a copy of the Mahout distribution, either from SVN or as a downloaded archive.
* Download the "1 Million MovieLens Dataset" from [Grouplens.org|http://www.grouplens.org/]
* Unpack the archive and copy movies.dat and ratings.dat to trunk/integration/src/main/resources/org/apache/mahout/cf/taste/example/grouplens
under the Mahout distribution directory.
* Navigate to the directory where you unpacked the Mahout distribution, and navigate to trunk.
* Run mvn -DskipTests install, which builds and installs Mahout core to your local repository
* cd integration
* You may need to give Maven more memory: in a bash shell, export MAVEN_OPTS=-Xmx1024M
* mvn jetty:run.
* Get recommendations by accessing the web application in your browser: http://localhost:8080/mahout-integration/RecommenderServlet?userID=1
This will produce a simple preference-item ID list which could be consumed by a client application.
Get more useful human-readable output with the debug parameter: http://localhost:8080/mahout-integration/RecommenderServlet?userID=1&debug=true


h2. Examples
h3. User-based Recommender
User-based recommenders are the "original", conventional style of recommender system. They
can produce good recommendations when tweaked properly; they are not necessarily the fastest
recommender systems and are thus suitable for small data sets (roughly, less than ten million
ratings). We'll start with an example of this.

First, create a DataModel of some kind. Here, we'll use a simple on based on data in a file.
The file should be in CSV format, with lines of the form "userID,itemID,prefValue" (e.g. "39505,290002,3.5"):

{code}
DataModel model = new FileDataModel(new File("data.txt"));
{code}

We'll use the PearsonCorrelationSimilarity implementation of UserSimilarity as our user correlation
algorithm, and add an optional preference inference algorithm:

{code}
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
// Optional:
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
{code}

Now we create a UserNeighborhood algorithm. Here we use nearest-3:

{code}
UserNeighborhood neighborhood =
          new NearestNUserNeighborhood(3, userSimilarity, model);{code}

Now we can create our Recommender, and add a caching decorator:

{code}
Recommender recommender =
          new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
{code}

Now we can get 10 recommendations for user ID "1234" — done!
{code}
List<RecommendedItem> recommendations =
          cachingRecommender.recommend(1234, 10);
{code}

h3.Item-based Recommender

We could have created an item-based recommender instead. Item-based recommender base recommendation
not on user similarity, but on item similarity. In theory these are about the same approach
to the problem, just from different angles. However the similarity of two items is relatively
fixed, more so than the similarity of two users. So, item-based recommenders can use pre-computed
similarity values in the computations, which make them much faster. For large data sets, item-based
recommenders are more appropriate.

Let's start over, again with a FileDataModel to start:

{code}
DataModel model = new FileDataModel(new File("data.txt"));
{code}

We'll also need an ItemSimilarity. We could use PearsonCorrelationSimilarity, which computes
item similarity in realtime, but, this is generally too slow to be useful. Instead, in a real
application, you would feed a list of pre-computed correlations to a GenericItemSimilarity:


{code}
// Construct the list of pre-computed correlations
Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
          ...;
ItemSimilarity itemSimilarity =
          new GenericItemSimilarity(correlations);

{code}

Then we can finish as before to produce recommendations:

{code}
Recommender recommender =
          new GenericItemBasedRecommender(model, itemSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
...
List<RecommendedItem> recommendations =
          cachingRecommender.recommend(1234, 10);
{code}

h3. Slope-One Recommender
This is a simple yet effective Recommender and we present another example to round out the
list:

{code}
DataModel model = new FileDataModel(new File("data.txt"));
          // Make a weighted slope one recommender
          Recommender recommender = new SlopeOneRecommender(model);
          Recommender cachingRecommender = new CachingRecommender(recommender);
        {code}


    
h2.Integration with your application
h3. Direct

You can create a Recommender, as shown above, wherever you like in your Java application,
and use it. This includes simple Java applications or GUI applications, server applications,
and J2EE web applications.

h3. Standalone server
A Mahout recommender can also be run as an external server, which may be the only option for
non-Java applications. It can be exposed as a web application via org.apach.mahout.cf.taste.web.RecommenderServlet,
and your application can then access recommendations via simple HTTP requests and response.
See above, and see the javadoc for details.

h2. Performance
h3. Runtime Performance
The more data you give, the better. Though Mahout is designed for performance, you will undoubtedly
run into performance issues at some point. For best results, consider using the following
command-line flags to your JVM:

* -server: Enables the server VM, which is generally appropriate for long-running, computation-intensive
applications.
* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte doesn't hurt when dealing
with tens millions of preferences. Mahout recommenders will generally use as much memory as
you give it for caching, which helps performance. Set the initial and max size to the same
value to avoid wasting time growing the heap, and to avoid having the JVM run minor collections
to avoid growing the heap, which will clear cached values.
* -da -dsa: Disable all assertions.
* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most of them in this
framework
* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only): Use a GC algorithm
designed to take advantage of multiple processors, and designed for throughput. This is a
default in J2SE 5.0.
* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can only hurt in the presence
of modern GC algorithms; they may force Mahout to remove cached data needlessly. This flag
isn't needed if you're sure your code and third-party code you use doesn't call this method.

Also consider the following tips:

* Use CachingRecommender on top of your custom Recommender implementation.
* When using JDBCDataModel, make sure you've taken basic steps to optimize the table storing
preference data. Create a primary key on the user ID and item ID columns, and an index on
them. Set them to be non-null. And so on. Tune your database for lots of concurrent reads!
When using JDBC, the database is almost always the bottleneck. Plenty of memory and caching
are even more important.
* Also, pooling database connections is essential to performance. If using a J2EE container,
it probably provides a way to configure connection pools. If you are creating your own DataSource
directly, try wrapping it in org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
* See MySQL-specific notes on performance in the javadoc for MySQLJDBCDataModel.

h3.Algorithm Performance: Which One Is Best?
There is no right answer; it depends on your data, your application, environment, and performance
needs. Mahout provides the building blocks from which you can construct the best Recommender
for your application. The links below provide research on this topic. You will probably need
a bit of trial-and-error to find a setup that works best. The code sample above provides a
good starting point.

Fortunately, Mahout provides a way to evaluate the accuracy of your Recommender on your own
data, in org.apache.mahout.cf.taste.eval"

{code}
DataModel myModel = ...;
RecommenderBuilder builder = new RecommenderBuilder() {
  public Recommender buildRecommender(DataModel model) {
    // build and return the Recommender to evaluate here
  }
};
RecommenderEvaluator evaluator =
          new AverageAbsoluteDifferenceRecommenderEvaluator();
double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
{code}

For "boolean" data model situations, where there are no notions of preference value, the above
evaluation based on estimated preference does not make sense. In this case, try this kind
of evaluation, which presents traditional information retrieval figures like precision and
recall, which are more meaningful:

{code}
...
RecommenderIRStatsEvaluator evaluator =
        new GenericRecommenderIRStatsEvaluator();
IRStatistics stats =
        evaluator.evaluate(builder, null, myModel, null, 3,
RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
        &sect;1.0);
{code}


h2. Useful Links
You'll want to look at these packages too, which offer more algorithms and approaches that
you may find useful:

* [Cofi|http://www.nongnu.org/cofi/]: A Java-Based Collaborative Filtering Library
* [CoFE|http://eecs.oregonstate.edu/iis/CoFE/]

Here's a handful of research papers that I've read and found particularly useful:

J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for
Collaborative Filtering|http://research.microsoft.com/research/pubs/view.aspx?tr_id=166],"
in Proceedings of the Fourteenth Conference on Uncertainity in Artificial Intelligence (UAI
1998), 1998.

B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation
algorithms|http://www10.org/cdrom/papers/519/]" in Proceedings of the Tenth International
Conference on the World Wide Web (WWW 10), pp. 285-295, 2001.

P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture
for collaborative filtering of netnews|http://doi.acm.org/10.1145/192844.192905]" in Proceedings
of the 1994 ACM conference on Computer Supported Cooperative Work (CSCW 1994), pp. 175-186,
1994.

J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing
collaborative filtering|http://www.grouplens.org/papers/pdf/algs.pdf]" in Proceedings of the
22nd annual international ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR 99), pp. 230-237, 1999.

Clifford Lyon, "[Movie Recommender|http://materialobjects.com/cf/MovieRecommender.pdf]" CSCI
E-280 final project, Harvard University, 2004.

Daniel Lemire, Anna Maclachlan, "[Slope One Predictors for Online Rating-Based Collaborative
Filtering|http://www.daniel-lemire.com/fr/abstracts/SDM2005.html]," Proceedings of SIAM Data
Mining (SDM '05), 2005.

Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and
Sean McGrath, "[RACOFI: A Rule-Applying Collaborative Filtering System|http://www.daniel-lemire.com/fr/documents/publications/racofi_nrc.pdf]","
Proceedings of COLA '03, 2003.

These links will take you to all the collaborative filtering reading you could ever want!
* [Paul Perry's notes|http://www.paulperry.net/notes/cf.asp]
* [James Thornton's collaborative filtering resources|http://jamesthornton.com/cf/]
* [Daniel Lemire's blog|http://www.daniel-lemire.com/blog/] which frequently covers collaborative
filtering topics


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message