mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Recommender First-Timer FAQ
Date Fri, 31 Dec 2010 13:35:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Recommender First-Timer FAQ (https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+First-Timer+FAQ)

Added by Sean Owen:
---------------------------------------------------------------------
Many people with an interest in recommenders arrive at Mahout since they're building a first
recommender system. Some starting questions have been asked enough times to warrant a FAQ
collecting advice and rules-of-thumb to newcomers.

For the interested, these topics are treated in detail in the book [Mahout in Action|http://manning.com/owen/].

Don't start with a distributed, Hadoop-based recommender; take on that complexity only if
necessary. Start with non-distributed recommenders. It is simpler, has fewer requirements,
and is more flexible. 

As a crude rule of thumb, a system with up to 100M user-item associations (ratings, preferences)
should "fit" onto one modern server machine with 4GB of heap available and run acceptably
as a real-time recommender. The system is invariably memory-bound since keeping data in memory
is essential to performance.

Beyond this point it gets expensive to deploy a machine with enough RAM, so, designing for
a distributed makes sense when nearing this scale. However most applications don't "really"
have 100M associations to process. Data can be sampled; noisy and old data can often be aggressively
pruned without significant impact on the result.

The next question is whether or not your system has preference values, or ratings. Do users
and items merely have an association or not, such as the existence or lack of a click? or
is behavior translated into some scalar value representing the user's degree of preference
for the item.

If you have ratings, then a good place to start is a GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity
similarity metric. If you don't have ratings, then a good place to start is GenericBooleanPrefItemBasedRecommender
and LogLikelihoodSimilarity.

If you want to do content-based item-item similarity, you need to implement your own ItemSimilarity.

If your data can be simply exported to a CSV file, use FileDataModel and push new files periodically.
If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref" counterpart if
appropriate, or its PostgreSQL counterpart, etc.) and put on top a ReloadFromJDBCDataModel.

This should give a reasonable starter system which responds fast. The nature of the system
is that new data comes in from the file or database only periodically -- perhaps on the order
of minutes. If that's not OK, you'll have to look into some more specialized work -- SlopeOneRecommender
deals with updates quickly, or, it is possible to do some work to update the GenericDataModel
in real time. 


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action

Mime
View raw message