mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [1/3] mahout git commit: WEBSITE Ported MR-Clustering Tutorials and Algos
Date Mon, 01 May 2017 21:29:51 GMT
Repository: mahout
Updated Branches:
  refs/heads/website b582dc529 -> 516e3fb9a


http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md b/website/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md
new file mode 100644
index 0000000..ee2c3e8
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md
@@ -0,0 +1,54 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+# Introduction to Item-Based Recommendations with Hadoop
+
+##Overview
+
+Mahout’s item based recommender is a flexible and easily implemented algorithm with a diverse
range of applications. The minimalism of the primary input file’s structure and availability
of ancillary filtering controls can make sourcing required data and shaping a desired output
both efficient and straightforward.
+
+Typical use cases include:
+
+* Recommend products to customers via an eCommerce platform (think: Amazon, Netflix, Overstock)
+* Identify organic sales opportunities
+* Segment users/customers based on similar item preferences
+
+Broadly speaking, Mahout's item-based recommendation algorithm takes as input customer preferences
by item and generates an output recommending similar items with a score indicating whether
a customer will "like" the recommended item.
+
+One of the strengths of the item based recommender is its adaptability to your business conditions
or research interests. For example, there are many available approaches for providing product
preference. One such method is to calculate the total orders for a given product for each
customer (i.e. Acme Corp has ordered Widget-A 5,678 times) while others rely on user preference
captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a product two thumbs’
up).
+
+Additionally, a variety of methodologies can be implemented to narrow the focus of Mahout's
recommendations, such as:
+
+* Exclude low volume or low profitability products from consideration
+* Group customers by segment or market rather than using user/customer level data
+* Exclude zero-dollar transactions, returns or other order types
+* Map product substitutions into the Mahout input (i.e. if WidgetA is a recommended item
replace it with WidgetX)
+
+The item based recommender output can be easily consumed by downstream applications (i.e.
websites, ERP systems or salesforce automation tools) and is configurable so users can determine
the number of item recommendations generated by the algorithm.
+
+##Example
+
+Testing the item based recommender can be a simple and potentially quite rewarding endeavor.
Whereas the typical sample use case for collaborative filtering focuses on utilization of,
and integration with, eCommerce platforms we can instead look at a potential use case applicable
to most businesses (even those without a web presence). Let’s look at how a company might
use Mahout’s item based recommender to identify new sales opportunities for an existing
customer base. First, you’ll need to get Mahout up and running, the instructions for which
can be found [here](https://mahout.apache.org/users/basics/quickstart.html). After you've
ensured Mahout is properly installed, we’re ready to run a quick example.
+
+**Step 1: Gather some test data**
+
+Mahout’s item based recommender relies on three key pieces of data: *userID*, *itemID*
and *preference*. The “users” could be website visitors or simply customers that purchase
products from your business. Similarly, items could be products, product groups or even pages
on your website – really anything you would want to recommend to a group of users or customers.
For our example let’s use customer orders as a proxy for preference. A simple count of distinct
orders by customer, by product will work for this example. You’ll find as you explore ways
to manipulate the item based recommender the preference value can be many things (page clicks,
explicit ratings, order counts, etc.). Once your test data is gathered put it in a *.txt*
file separated by commas with no column headers included.
+
+**Step 2: Pick a similarity measure**
+
+Choosing a similarity measure for use in a production environment is something that requires
careful testing, evaluation and research. For our example purposes, we’ll just go with a
Mahout similarity classname called *SIMILARITY_LOGLIKELIHOOD*.
+
+**Step 3: Configure the Mahout command**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed properly we’re
ready to configure our syntax. Enter the following command:
+
+    $ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output
--numRecommendations 25
+
+Running the command will execute a series of jobs the final product of which will be an output
file deposited to the directory specified in the command syntax. The output file will contain
two columns: the *userID* and an array of *itemIDs* and scores.
+
+**Step 4: Making use of the output and doing more with Mahout**
+
+The output file generated in our simple example can be transformed using your tool of choice
and consumed by downstream applications. There exist a variety of configuration options for
Mahout’s item based recommender to accommodate custom business requirements; exploring and
testing various configurations to suit your needs will doubtless lead to additional questions.
Our user community is accessible via our [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html)
and the book *Mahout In Action* is a fantastic (but slightly outdated) starting point. 

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/matrix-factorization.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/matrix-factorization.md b/website/docs/tutorials/map-reduce/recommender/matrix-factorization.md
new file mode 100644
index 0000000..63de4fd
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/matrix-factorization.md
@@ -0,0 +1,187 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="MatrixFactorization-Intro"></a>
+# Introduction to Matrix Factorization for Recommendation Mining
+
+In the mathematical discipline of linear algebra, a matrix decomposition 
+or matrix factorization is a dimensionality reduction technique that factorizes a matrix
into a product of matrices, usually two. 
+There are many different matrix decompositions, each finds use among a particular class of
problems.
+
+In mahout, the SVDRecommender provides an interface to build recommender based on matrix
factorization.
+The idea behind is to project the users and items onto a feature space and try to optimize
U and M so that U \* (M^t) is as close to R as possible:
+
+     U is n * p user feature matrix, 
+     M is m * p item feature matrix, M^t is the conjugate transpose of M,
+     R is n * m rating matrix,
+     n is the number of users,
+     m is the number of items,
+     p is the number of features
+
+We usually use RMSE to represent the deviations between predictions and atual ratings.
+RMSE is defined as the squared root of the sum of squared errors at each known user item
ratings.
+So our matrix factorization target could be mathmatically defined as:
+
+     find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5))
+     
+     SSE = sum(e(u,i)^2)
+     e(u,i) = r(u, i) - U[u,] * (M[i,]^t) = r(u,i) - sum(U[u,f] * M[i,f]), f = 0, 1, .. p
- 1
+     K is the number of known user item ratings.
+
+<a name="MatrixFactorization-Factorizers"></a>
+
+Mahout has implemented matrix factorization based on 
+
+    (1) SGD(Stochastic Gradient Descent)
+    (2) ALSWR(Alternating-Least-Squares with Weighted-λ-Regularization).
+
+## SGD
+
+Stochastic gradient descent is a gradient descent optimization method for minimizing an objective
function that is written as a su of differentiable functions.
+
+       Q(w) = sum(Q_i(w)), 
+
+where w is the parameters to be estimated,
+      Q(w) is the objective function that could be expressed as sum of differentiable functions,
+      Q_i(w) is associated with the i-th observation in the data set 
+
+In practice, w is estimated using an iterative method at each single sample until an approximate
miminum is obtained,
+
+      w = w - alpha * (d(Q_i(w))/dw),
+where aplpha is the learning rate,
+      (d(Q_i(w))/dw) is the first derivative of Q_i(w) on w.
+
+In matrix factorization, the RatingSGDFactorizer class implements the SGD with w = (U, M)
and objective function Q(w) = sum(Q(u,i)),
+
+       Q(u,i) =  sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + (M[i,] * (M[i,]^t))]
/ 2
+
+where Q(u, i) is the objecive function for user u and item i,
+      e(u, i) is the error between predicted rating and actual rating,
+      U[u,] is the feature vector of user u,
+      M[i,] is the feature vector of item i,
+      lambda is the regularization parameter to prevent overfitting.
+
+The algorithm is sketched as follows:
+  
+      init U and M with randomized value between 0.0 and 1.0 with standard Gaussian distribution
  
+      
+      for(iter = 0; iter < numIterations; iter++)
+      {
+          for(user u and item i with rating R[u,i])
+          {
+              predicted_rating = U[u,] *  M[i,]^t //dot product of feature vectors between
user u and item i
+              err = R[u, i] - predicted_rating
+              //adjust U[u,] and M[i,]
+              // p is the number of features
+              for(f = 0; f < p; f++) {
+                 NU[u,f] = U[u,f] - alpha * d(Q(u,i))/d(U[u,f]) //optimize U[u,f]
+                         = U[u, f] + alpha * (e(u,i) * M[i,f] - lambda * U[u,f]) 
+              }
+              for(f = 0; f < p; f++) {
+                 M[i,f] = M[i,f] - alpha * d(Q(u,i))/d(M[i,f])  //optimize M[i,f] 
+                        = M[i,f] + alpha * (e(u,i) * U[u,f] - lambda * M[i,f]) 
+              }
+              U[u,] = NU[u,]
+          }
+      }
+
+## SVD++
+
+SVD++ is an enhancement of the SGD matrix factorization. 
+
+It could be considered as an integration of latent factor model and neighborhood based model,
considering not only how users rate, but also who has rated what. 
+
+The complete model is a sum of 3 sub-models with complete prediction formula as follows:

+    
+    pr(u,i) = b[u,i] + fm + nm   //user u and item i
+    
+    pr(u,i) is the predicted rating of user u on item i,
+    b[u,i] = U + b(u) + b(i)
+    fm = (q[i,]) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])),  j is an item in N(u)
+    nm = pow(|R(i;u;k)|, -0.5) * sum((r[u,j0] - b[u,j0]) * w[i,j0]) + pow(|N(i;u;k)|, -0.5)
* sum(c[i,j1]), j0 is an item in R(i;u;k), j1 is an item in N(i;u;k)
+
+The associated regularized squared error function to be minimized is:
+
+    {sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i]))  - lambda * (b(u) * b(u) + b(i) * b(i)
+ ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * w[i,j0]) + sum(c[i,j1] * c[i,j1]))}
+
+b[u,i] is the baseline estimate of user u's predicted rating on item i. U is users' overall
average rating and b(u) and b(i) indicate the observed deviations of user u and item i's ratings
from average. 
+
+The baseline estimate is to adjust for the user and item effects - i.e, systematic tendencies
for some users to give higher ratings than others and tendencies
+for some items to receive higher ratings than other items.
+
+fm is the latent factor model to capture the interactions between user and item via a feature
layer. q[i,] is the feature vector of item i, and the rest part of the formula represents
user u with a user feature vector and a sum of features of items in N(u),
+N(u) is the set of items that user u have expressed preference, y[j,] is feature vector of
an item in N(u).
+
+nm is an extension of the classic item-based neighborhood model. 
+It captures not only the user's explicit ratings but also the user's implicit preferences.
R(i;u;k) is the set of items that have got explicit rating from user u and only retain top
k most similar items. r[u,j0] is the actual rating of user u on item j0, 
+b[u,j0] is the corresponding baseline estimate.
+
+The difference between r[u,j0] and b[u,j0] is weighted by a parameter w[i,j0], which could
be thought as the similarity between item i and j0. 
+
+N[i;u;k] is the top k most similar items that have got the user's preference.
+c[i;j1] is the paramter to be estimated. 
+
+The value of w[i,j0] and c[i,j1] could be treated as the significance of the 
+user's explicit rating and implicit preference respectively.
+
+The parameters b, y, q, w, c are to be determined by minimizing the the associated regularized
squared error function through gradient descent. We loop over all known ratings and for a
given training case r[u,i], we apply gradient descent on the error function and modify the
parameters by moving in the opposite direction of the gradient.
+
+For a complete analysis of the SVD++ algorithm,
+please refer to the paper [Yehuda Koren: Factorization Meets the Neighborhood: a Multifaceted
Collaborative Filtering Model, KDD 2008](http://research.yahoo.com/files/kdd08koren.pdf).
+ 
+In Mahout,SVDPlusPlusFactorizer class is a simplified implementation of the SVD++ algorithm.It
mainly uses the latent factor model with item feature vector, user feature vector and user's
preference, with pr(u,i) = fm = (q[i,]) \* (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) and the
parameters to be determined are q, p, y. 
+
+The update to q, p, y in each gradient descent step is:
+
+      err(u,i) = r[u,i] - pr[u,i]
+      q[i,] = q[i,] + alpha * (err(u,i) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) - lamda
* q[i,]) 
+      p[u,] = p[u,] + alpha * (err(u,i) * q[i,] - lambda * p[u,])
+      for j that is an item in N(u):
+         y[j,] = y[j,] + alpha * (err(u,i) * pow(|N(u)|, -0.5) * q[i,] - lambda * y[j,])
+
+where alpha is the learning rate of gradient descent, N(u) is the items that user u has expressed
preference.
+
+## Parallel SGD
+
+Mahout has a parallel SGD implementation in ParallelSGDFactorizer class. It shuffles the
user ratings in every iteration and 
+generates splits on the shuffled ratings. Each split is handled by a thread to update the
user features and item features using 
+vanilla SGD. 
+
+The implementation could be traced back to a lock-free version of SGD based on paper 
+[Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf).
+
+## ALSWR
+
+ALSWR is an iterative algorithm to solve the low rank factorization of user feature matrix
U and item feature matrix M.  
+The loss function to be minimized is formulated as the sum of squared errors plus [Tikhonov
regularization](http://en.wikipedia.org/wiki/Tikhonov_regularization):
+
+     L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * (sum(n(u) * ||U[u,]||^2)
+ sum(n(i) * ||M[i,]||^2))
+ 
+At the beginning of the algorithm, M is initialized with the average item ratings as its
first row and random numbers for the rest row.  
+
+In every iteration, we fix M and solve U by minimization of the cost function L(R, U, M),
then we fix U and solve M by the minimization of 
+the cost function similarly. The iteration stops until a certain stopping criteria is met.
+
+To solve the matrix U when M is given, each user's feature vector is calculated by resolving
a regularized linear least square error function 
+using the items the user has rated and their feature vectors:
+
+      1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 
+
+Similary, when M is updated, we resolve a regularized linear least square error function
using feature vectors of the users that have rated the 
+item and their feature vectors:
+
+      1/2 * d(L(R,U,M)) / d(M[i,f]) = 0
+
+The ALSWRFactorizer class is a non-distributed implementation of ALSWR using multi-threading
to dispatch the computation among several threads.
+Mahout also offers a [parallel map-reduce implementation](https://mahout.apache.org/users/recommender/intro-als-hadoop.html).
+
+<a name="MatrixFactorization-Reference"></a>
+# Reference:
+
+[Stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
+    
+[ALSWR](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/quickstart.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/quickstart.md b/website/docs/tutorials/map-reduce/recommender/quickstart.md
new file mode 100644
index 0000000..4627444
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/quickstart.md
@@ -0,0 +1,32 @@
+---
+layout: default
+title: Recommender Quickstart
+theme:
+    name: retro-mahout
+---
+
+# Recommender Overview
+
+Recommenders have changed over the years. Mahout contains a long list of them, which you
can still use. But to get the best  out of our more modern aproach we'll need to think of
the Recommender as a "model creation" component&mdash;supplied by Mahout's new spark-itemsimilarity
job, and a "serving" component&mdash;supplied by a modern scalable search engine, like
Solr.
+
+![image](http://i.imgur.com/fliHMBo.png)
+
+To integrate with your application you will collect user interactions storing them in a DB
and also in a from usable by Mahout. The simplest way to do this is to log user interactions
to csv files (user-id, item-id). The DB should be setup to contain the last n user interactions,
which will form part of the query for recommendations.
+
+Mahout's spark-itemsimilarity will create a table of (item-id, list-of-similar-items) in
csv form. Think of this as an item collection with one field containing the item-ids of similar
items. Index this with your search engine. 
+
+When your application needs recommendations for a specific person, get the latest user history
of interactions from the DB and query the indicator collection with this history. You will
get back an ordered list of item-ids. These are your recommendations. You may wish to filter
out any that the user has already seen but that will depend on your use case.
+
+All ids for users and items are preserved as string tokens and so work as an external key
in DBs or as doc ids for search engines, they also work as tokens for search queries.
+
+##References
+
+1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: [Creating a Multimodal
Recommender with Mahout and a Search Engine](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
 LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.
+
+##Mahout Model Creation
+
+See the page describing [*spark-itemsimilarity*](http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html)
for more details.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/recommender-documentation.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/recommender-documentation.md b/website/docs/tutorials/map-reduce/recommender/recommender-documentation.md
new file mode 100644
index 0000000..8ba5b28
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/recommender-documentation.md
@@ -0,0 +1,277 @@
+---
+layout: default
+title: Recommender Documentation
+theme:
+    name: retro-mahout
+---
+
+<a name="RecommenderDocumentation-Overview"></a>
+## Overview
+
+_This documentation concerns the non-distributed, non-Hadoop-based
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoop-based code. It may be viewed as a
+somewhat separate, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoop-based
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+
+A Mahout-based collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterprise-ready; it's designed for performance, scalability
+and flexibility.
+
+Top-level packages define the Mahout interfaces to these key abstractions:
+
+* **DataModel**
+* **UserSimilarity**
+* **ItemSimilarity**
+* **UserNeighborhood**
+* **Recommender**
+
+Subpackages of *org.apache.mahout.cf.taste.impl* hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! 
+
+<a name="RecommenderDocumentation-Architecture"></a>
+## Architecture
+
+![doc](../../images/taste-architecture.png)
+
+This diagram shows the relationship between various Mahout components in a
+user-based recommender. An item-based recommender system is similar except
+that there are no Neighborhood algorithms involved.
+
+<a name="RecommenderDocumentation-Recommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+**GenericUserBasedRecommender** or **GenericItemBasedRecommender**,
+possibly decorated by **CachingRecommender**.
+
+<a name="RecommenderDocumentation-DataModel"></a>
+### DataModel
+A **DataModel** is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Be sure to wrap this with a **ReloadFromJDBCDataModel** to get good performance!
Mahout provides **MySQLJDBCDataModel**, for example, to access preference data from a database
via JDBC and MySQL. Another exists for PostgreSQL. Mahout also provides a **FileDataModel**,
which is fine for small applications.
+
+Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A **Preference** object or **PreferenceArray** object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+
+Finally, Mahout supports, in various ways, a so-called "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+
+<a name="RecommenderDocumentation-UserSimilarity"></a>
+### UserSimilarity
+A **UserSimilarity** defines a notion of similarity between two users. This is
+a crucial part of a recommendation engine. These are attached to a
+**Neighborhood** implementation. **ItemSimilarity** is analagous, but find
+similarity between items.
+
+<a name="RecommenderDocumentation-UserNeighborhood"></a>
+### UserNeighborhood
+In a user-based recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A **UserNeighborhood**
+defines a means of determining that neighborhood &mdash; for example,
+nearest 10 users. Implementations typically need a **UserSimilarity** to
+operate.
+
+<a name="RecommenderDocumentation-Examples"></a>
+## Examples
+<a name="RecommenderDocumentation-User-basedRecommender"></a>
+### User-based Recommender
+User-based recommenders are the "original", conventional style of
+recommender systems. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+
+First, create a **DataModel** of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+
+We'll use the **PearsonCorrelationSimilarity** implementation of **UserSimilarity**
+as our user correlation algorithm, and add an optional preference inference
+algorithm:
+
+
+    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+
+
+Now we create a **UserNeighborhood** algorithm. Here we use nearest-3:
+
+
+    UserNeighborhood neighborhood =
+    	  new NearestNUserNeighborhood(3, userSimilarity, model);{code}
+    
+Now we can create our **Recommender**, and add a caching decorator:
+    
+
+    Recommender recommender =
+	  new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
+    Recommender cachingRecommender = new CachingRecommender(recommender);
+
+    
+Now we can get 10 recommendations for user ID "1234" &mdash; done!
+
+    List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+    
+## Item-based Recommender
+    
+We could have created an item-based recommender instead. Item-based
+recommenders base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, item-based
+recommenders can use pre-computed similarity values in the computations,
+which make them much faster. For large data sets, item-based recommenders
+are more appropriate.
+    
+Let's start over, again with a **FileDataModel** to start:
+    
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+    
+We'll also need an **ItemSimilarity**. We could use
+**PearsonCorrelationSimilarity**, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of pre-computed correlations to a
+**GenericItemSimilarity**: 
+    
+
+    // Construct the list of pre-computed correlations
+    Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+	  ...;
+    ItemSimilarity itemSimilarity =
+	  new GenericItemSimilarity(correlations);
+
+
+    
+Then we can finish as before to produce recommendations:
+    
+
+    Recommender recommender =
+	  new GenericItemBasedRecommender(model, itemSimilarity);
+    Recommender cachingRecommender = new CachingRecommender(recommender);
+    ...
+    List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+
+<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
+## Integration with your application
+
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+
+<a name="RecommenderDocumentation-Performance"></a>
+## Performance
+<a name="RecommenderDocumentation-RuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following command-line flags to
+your JVM:
+
+* -server: Enables the server VM, which is generally appropriate for
+long-running, computation-intensive applications.
+* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* -da -dsa: Disable all assertions.
+* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and third-party code you use doesn't call this method.
+
+Also consider the following tips:
+
+* Use **CachingRecommender** on top of your custom **Recommender** implementation.
+* When using **JDBCDataModel**, make sure you wrap it with the **ReloadFromJDBCDataModel**
to load data into memory!. 
+
+<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trial-and-error to find a setup that works best. The code sample above
+provides a good starting point.
+
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in org.apache.mahout.cf.taste.eval
+
+
+    DataModel myModel = ...;
+    RecommenderBuilder builder = new RecommenderBuilder() {
+      public Recommender buildRecommender(DataModel model) {
+        // build and return the Recommender to evaluate here
+      }
+    };
+    RecommenderEvaluator evaluator =
+    	  new AverageAbsoluteDifferenceRecommenderEvaluator();
+    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+
+
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try a *RecommenderIRStatsEvaluator*, which presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful.
+
+
+<a name="RecommenderDocumentation-UsefulLinks"></a>
+## Useful Links
+
+
+Here's a handful of research papers that I've read and found particularly
+useful:
+
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for
Collaborative Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation
algorithms](http://www10.org/cdrom/papers/519/)
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285-295, 2001.
+
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture
for collaborative filtering of netnews](http://doi.acm.org/10.1145/192844.192905)
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175-186, 1994.
+
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing
collaborative filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
+1999.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md
b/website/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md
new file mode 100644
index 0000000..2b090e6
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md
@@ -0,0 +1,54 @@
+---
+layout: default
+title: Recommender First-Timer FAQ
+theme:
+    name: retro-mahout
+---
+
+# Recommender First Timer Dos and Don'ts
+
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
+newcomers.
+
+For the interested, these topics are treated in detail in the book [Mahout in Action](http://manning.com/owen/).
+
+Don't start with a distributed, Hadoop-based recommender; take on that
+complexity only if necessary. Start with non-distributed recommenders. It
+is simpler, has fewer requirements, and is more flexible. 
+
+As a crude rule of thumb, a system with up to 100M user-item associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a real-time recommender. The system
+is invariably memory-bound since keeping data in memory is essential to
+performance.
+
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+
+If you want to do content-based item-item similarity, you need to implement
+your own ItemSimilarity.
+
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically -- perhaps on the order of minutes. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md b/website/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md
new file mode 100644
index 0000000..da17b38
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md
@@ -0,0 +1,133 @@
+---
+layout: default
+title: User Based Recommender in 5 Minutes
+theme:
+    name: retro-mahout
+---
+
+# Creating a User-Based Recommender in 5 minutes
+
+##Prerequisites
+
+Create a java project in your favorite IDE and make sure mahout is on the classpath. The
easiest way to accomplish this is by importing it via maven as described on the [Quickstart](/users/basics/quickstart.html)
page.
+
+
+## Dataset
+
+Mahout's recommenders expect interactions between users and items as input. The easiest way
to supply such data to Mahout is in the form of a textfile, where every line has the format
*userID,itemID,value*. Here *userID* and *itemID* refer to a particular user and a particular
item, and *value* denotes the strength of the interaction (e.g. the rating given to a movie).
+
+In this example, we'll use some made up data for simplicity. Create a file called "dataset.csv"
and copy the following example interactions into the file. 
+
+<pre>
+1,10,1.0
+1,11,2.0
+1,12,5.0
+1,13,5.0
+1,14,5.0
+1,15,4.0
+1,16,5.0
+1,17,1.0
+1,18,5.0
+2,10,1.0
+2,11,2.0
+2,15,5.0
+2,16,4.5
+2,17,1.0
+2,18,5.0
+3,11,2.5
+3,12,4.5
+3,13,4.0
+3,14,3.0
+3,15,3.5
+3,16,4.5
+3,17,4.0
+3,18,5.0
+4,10,5.0
+4,11,5.0
+4,12,5.0
+4,13,0.0
+4,14,2.0
+4,15,3.0
+4,16,1.0
+4,17,4.0
+4,18,1.0
+</pre>
+
+## Creating a user-based recommender
+
+Create a class called *SampleRecommender* with a main method.
+
+The first thing we have to do is load the data from the file. Mahout's recommenders use an
interface called *DataModel* to handle interaction data. You can load our made up interactions
like this:
+
+<pre>
+DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
+</pre>
+
+In this example, we want to create a user-based recommender. The idea behind this approach
is that when we want to compute recommendations for a particular users, we look for other
users with a similar taste and pick the recommendations from their items. For finding similar
users, we have to compare their interactions. There are several methods for doing this. One
popular method is to compute the [correlation coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)
between their interactions. In Mahout, you use this method as follows:
+
+<pre>
+UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
+</pre>
+
+The next thing we have to do is to define which similar users we want to leverage for the
recommender. For the sake of simplicity, we'll use all that have a similarity greater than
*0.1*. This is implemented via a *ThresholdUserNeighborhood*:
+
+<pre>UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity,
model);</pre>
+
+Now we have all the pieces to create our recommender:
+
+<pre>
+UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
+</pre>
+        
+We can easily ask the recommender for recommendations now. If we wanted to get three items
recommended for the user with *userID* 2, we would do it like this:
+	
+
+<pre>
+List<RecommendedItem> recommendations = recommender.recommend(2, 3);
+for (RecommendedItem recommendation : recommendations) {
+  System.out.println(recommendation);
+}
+</pre>
+
+
+Congratulations, you have built your first recommender!
+
+
+## Evaluation
+
+You might ask yourself, how to make sure that your recommender returns good results. Unfortunately,
the only way to be really sure about the quality is by doing an A/B test with real users in
a live system.
+
+We can however try to get a feel of the quality, by statistical offline evaluation. Just
keep in mind that this does not replace a test with real users!
+
+One way to check whether the recommender returns good results is by doing a **hold-out**
test. We partition our dataset into two sets: a trainingset consisting of 90% of the data
and a testset consisting of 10%. Then we train our recommender using the training set and
look how well it predicts the unknown interactions in the testset.
+
+To test our recommender, we create a class called *EvaluateRecommender* with a main method
and add an inner class called *MyRecommenderBuilder* that implements the *RecommenderBuilder*
interface. We implement the *buildRecommender* method and make it setup our user-based recommender:
+
+<pre>
+UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
+UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel);
+return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
+</pre>
+
+Now we have to create the code for the test. We'll check how much the recommender misses
the real interaction strength on average. We employ an *AverageAbsoluteDifferenceRecommenderEvaluator*
for this. The following code shows how to put the pieces together and run a hold-out test:

+
+<pre>
+DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
+RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
+RecommenderBuilder builder = new MyRecommenderBuilder();
+double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
+System.out.println(result);
+</pre>
+
+Note: if you run this test multiple times, you will get different results, because the splitting
into trainingset and testset is done randomly. 
+
+
+
+
+
+
+
+
+
+
+


Mime
View raw message