mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1360593 [15/17] - in /mahout/site/trunk: ./ cgi-bin/ content/ content/attachments/ content/attachments/101992/ content/attachments/116559/ content/attachments/22872433/ content/attachments/22872443/ content/attachments/23335706/ content/at...
Date Thu, 12 Jul 2012 09:26:03 GMT
Added: mahout/site/trunk/content/quickstart.mdtext
--- mahout/site/trunk/content/quickstart.mdtext (added)
+++ mahout/site/trunk/content/quickstart.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,63 @@
+Title: Quickstart
+<a name="Quickstart-DownloadandInstallation"></a>
+# Download and Installation
+* [Download and installation](buildingmahout.html)
+<a name="Quickstart-RunningtheExamples"></a>
+# Running the Examples
+Mahout runs on [Apache Hadoop ](
+. So, to run these examples, install the latest compatible [^1] [^2] [Hadoop Common release](
+[^1]: When using a release, see the release notes
+[^2]: When using trunk,see <i>parent/pom.xml</i>
+<a name="Quickstart-Clustering"></a>
+## Clustering
+* [Clustering of synthetic control data](clustering-of-synthetic-control-data.html)
+* [Visualizing Sample Clusters](visualizing-sample-clusters.html)
+* [Clustering Seinfeld Episodes](clustering-seinfeld-episodes.html)
+<a name="Quickstart-Classification"></a>
+## Classification
+* [Twenty Newsgroups](twenty-newsgroups.html)
+* [Wikipedia Bayes Example](wikipedia-bayes-example.html)
+<a name="Quickstart-GeneticProgramming"></a>
+## Genetic Programming
+* [Watchmaker](
+<a name="Quickstart-DecisionForest"></a>
+## Decision Forest
+* [Breiman Example](breiman-example.html)
+* [Partial Implementation](partial-implementation.html)
+<a name="Quickstart-Recommendationmining"></a>
+## Recommendation mining
+This package comes with four examples based on netflix data, bookcrossing,
+grouplens and jester data.
+* [RecommendationExamples](recommendationexamples.html)
+<a name="Quickstart-WheretoGoNext"></a>
+# Where to Go Next
+* If you are working with text, read more on [Creating Vectors from Text](creating-vectors-from-text.html)
+* To learn more on grouping items by similarity and identifying clusters
+read more on [Clustering your data](clusteringyourdata.html)
+* If you want to classify incoming items into predefinied categories read
+more on [Classifying your data ](classifyingyourdata.html)
+* To know how to Mine frequent patterns from your data read more on [Parallel Frequent Pattern Mining](parallel-frequent-pattern-mining.html)
+* To read more on building recommendation engines have a look at the [Taste documentation ](
+ and [Taste hadoop commandline|TasteCommandLine]
+Go back to the [Main Page](mahout-wiki.html)
+ for more information. 

Added: mahout/site/trunk/content/random-forests.mdtext
--- mahout/site/trunk/content/random-forests.mdtext (added)
+++ mahout/site/trunk/content/random-forests.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,228 @@
+Title: Random Forests
+<a name="RandomForests-HowtogrowaDecisionTree"></a>
+### How to grow a Decision Tree
+source : \[3\](3\.html)
+Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
+the value of the *j*'th attribute in the *i*'th input datapoint. Each
+column consists of either all real values or all categorical values.
+Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
+class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
+Output: An Unpruned decision tree
+If all records in *X* have identical values in all their attributes (this
+includes the case where *R<2*), return a Leaf Node predicting the majority
+output, breaking ties randomly. This case also includes
+If all values in *Y* are the same, return a Leaf Node predicting this value
+as the output
+&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
+&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is
+*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information
+Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is
+IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
+&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
+splitting attribute we'll use)
+&nbsp;&nbsp;&nbsp; If *j\** is categorical then
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let
+*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
+Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* =
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node,
+splitting on *j*'th attribute. The number of children equals the number of
+values of the *j*'th attribute, and the *v*'th child is
+&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset
+of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
+corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* =
+&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
+in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
+subset of *Y*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* =
+&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on
+*j*'th attribute. It has two children corresponding to whether the *j*'th
+attribute is above or below the given threshold.
+*Note*: There are alternatives to Information Gain for splitting nodes
+<a name="RandomForests-Informationgain"></a>
+### Information gain
+source : \[3\](3\.html)
+1. h4. nominal attributes
+suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
+P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
+H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
+H(Y\|X=v) = the entropy of Y among only those records in which X has value
+H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
+IG(Y\|X) = H(Y) - H(Y\|X)
+1. h4. real-valued attributes
+suppose X is real valued
+define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
+define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
+define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)
+<a name="RandomForests-HowtogrowaRandomForest"></a>
+### How to grow a Random Forest
+source : \[1\](1\.html)
+Each tree is grown as follows:
+1. if the number of cases in the training set is *N*, sample *N* cases at
+random \-but with replacement, from the original data. This sample will be
+the training set for the growing tree.
+1. if there are *M* input variables, a number *m << M* is specified such
+that at each node, *m* variables are selected at random out of the *M* and
+the best split on these *m* is used to split the node. The value of *m* is
+held constant during the forest growing.
+1. each tree is grown to its large extent possible. There is no pruning.
+<a name="RandomForests-RandomForestparameters"></a>
+### Random Forest parameters
+source : \[2\](2\.html)
+Random Forests are easy to use, the only 2 parameters a user of the
+technique has to determine are the number of trees to be used and the
+number of variables (*m*) to be randomly selected from the available set of
+Breinman's recommendations are to pick a large number of trees, as well as
+the square root of the number of variables for *m*.
+<a name="RandomForests-Howtopredictthelabelofacase"></a>
+### How to predict the label of a case
+&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute
+= j* then the split is done on the *j*'th attribute
+&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where
+*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
+&nbsp;&nbsp;&nbsp; Output: label of *V*
+&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
+&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted
+by *node*
+&nbsp;&nbsp; &nbsp;Else
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j =
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is
+categorical then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+Let *v* = *V{*}{*}{~}j{~}*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
+value *v*
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+Let *t = node.threshold* (split threshold)
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child
+node corresponding to (*<t*)
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* =
+child node corresponding to (*>=t*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return
+<a name="RandomForests-Theoutofbag(oob)errorestimation"></a>
+### The out of bag (oob) error estimation
+source : \[1\](1\.html)
+in random forests, there is no need for cross-validation or a separate test
+set to get an unbiased estimate of the test set error. It is estimated
+internally, during the run, as follows:
+* each tree is constructed using a different bootstrap sample from the
+original data. About one-third of the cases left of the bootstrap sample
+and not used in the construction of the _kth_ tree.
+* put each case left out in the construction of the _kth_ tree down the
+_kth{_}tree to get a classification. In this way, a test set classification
+is obtained for each case in about one-thrid of the trees. At the end of
+the run, take *j* to be the class that got mort of the the votes every time
+case *n* was _oob_. The proportion of times that *j* is not equal to the
+true class of *n* averaged over all cases is the _oob error estimate_. This
+has proven to be unbiased in many tests.
+<a name="RandomForests-OtherRFuses"></a>
+### Other RF uses
+source : \[1\](1\.html)
+* variable importance
+* gini importance
+* proximities
+* scaling
+* prototypes
+* missing values replacement for the training set
+* missing values replacement for the test set
+* detecting mislabeled cases
+* detecting outliers
+* detecting novelties
+* unsupervised learning
+* balancing prediction error
+Please refer to \[1\](1\.html)
+ for a detailed description
+<a name="RandomForests-References"></a>
+### References
+&nbsp; Random Forests - Classification Description
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[](
+&nbsp; B. Larivière & D. Van Den Poel, 2004. "Predicting Customer Retention
+and Profitability by Using Random Forests and Regression Forests
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of
+Economics and Business Administration, Ghent University, Belgium 04/282,
+Ghent University,
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and
+Business Administration.
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : [](
+&nbsp; Decision Trees - Andrew W. Moore\[4\]
+&nbsp; &nbsp; &nbsp; &nbsp;\[1\](1\.html)
+&nbsp; Information Gain - Andrew W. Moore
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [](

Added: mahout/site/trunk/content/recommendationexamples.mdtext
--- mahout/site/trunk/content/recommendationexamples.mdtext (added)
+++ mahout/site/trunk/content/recommendationexamples.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,47 @@
+Title: RecommendationExamples
+<a name="RecommendationExamples-Introduction"></a>
+# Introduction 
+This quick start page describes how to run the recommendation examples
+provided by Mahout. Mahout comes with four recommendation mining examples.
+They are based on netflixx, jester, grouplens and bookcrossing
+<a name="RecommendationExamples-Steps"></a>
+# Steps 
+<a name="RecommendationExamples-Testingitononesinglemachine"></a>
+## Testing it on one single machine 
+In the examples directory type: 
+    mvn -q exec:java
+    mvn -q exec:java
+    mvn -q exec:java
+    mvn -q exec:java
+    mvn -q exec:java
+Here, the command line options need only be:
+    -i [input file]
+Note that the GroupLens example is designed for the "1 million" data set,
+available at . And  the "input file" above
+is the ratings.dat contained in the zipfile from the data set . This file
+has an unusual format and so has a special parser. The example code here
+can be easily modified to use a regular FileDataModel and thus work on more
+standard input, including the other data sets available at this site.

Added: mahout/site/trunk/content/recommender-documentation.mdtext
--- mahout/site/trunk/content/recommender-documentation.mdtext (added)
+++ mahout/site/trunk/content/recommender-documentation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,394 @@
+Title: Recommender Documentation
+<a name="RecommenderDocumentation-Overview"></a>
+## Overview
+_This documentation concerns the non-distributed, non-Hadoop-based
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoop-based code. It may be viewed as a
+somewhat separate, older, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoop-based
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+A Mahout-based collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterprise-ready; it's designed for performance, scalability
+and flexibility.
+Mahout recommenders are not just for Java; it can be run as an external
+server which exposes recommendation logic to your application via web
+services and HTTP.
+Top-level packages define the Mahout interfaces to these key abstractions:
+* DataModel
+* UserSimilarity
+* ItemSimilarity
+* UserNeighborhood
+* Recommender
+Subpackages of hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! For the academically inclined, Mahout
+supports both *memory-based*, *item-based* recommender systems, *slope one*
+recommenders, and a couple other experimental implementations. It does not
+currently support *model-based* recommenders.
+<a name="RecommenderDocumentation-Architecture"></a>
+## Architecture
+This diagram shows the relationship between various Mahout components in a
+user-based recommender. An item-based recommender system is similar except
+that there are no PreferenceInferrers or Neighborhood algorithms involved.
+<a name="RecommenderDocumentation-Recommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+GenericUserBasedRecommender implementation GenericItemBasedRecommender,
+possibly decorated by CachingRecommender.
+<a name="RecommenderDocumentation-DataModel"></a>
+### DataModel
+A DataModel is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Mahout provides MySQLJDBCDataModel, for example, to
+access preference data from a database via JDBC and MySQL. Another exists
+for PostgreSQL. Mahout also provides a FileDataModel.
+There are no abstractions for a user or item in the object model (not
+anymore). Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A Preference object or PreferenceArray object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+Finally, Mahout supports, in various ways, a so-called "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+<a name="RecommenderDocumentation-UserSimilarity"></a>
+### UserSimilarity
+A UserSimilarity defines a notion of similarity between two Users. This is
+a crucial part of a recommendation engine. These are attached to a
+Neighborhood implementation. ItemSimilarities are analagous, but find
+similarity between Items.
+<a name="RecommenderDocumentation-UserNeighborhood"></a>
+### UserNeighborhood
+In a user-based recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A UserNeighborhood
+defines a means of determining that neighborhood &mdash; for example,
+nearest 10 users. Implementations typically need a UserSimilarity to
+<a name="RecommenderDocumentation-Requirements"></a>
+## Requirements
+<a name="RecommenderDocumentation-Required"></a>
+### Required
+* [Java/ J2SE 6.0](
+<a name="RecommenderDocumentation-Optional"></a>
+### Optional
+* [Apache Maven](
+  2.2.1 or later, if you want to build from source or build examples. (Mac
+users note that even OS X 10.5 ships with Maven 2.0.6, which will not
+* Mahout web applications require a [Servlet 2.3+](
+ container, such as [Apache Tomcat|]
+. It may in fact work with oldercontainers with slight modification.
+<a name="RecommenderDocumentation-Demo"></a>
+## Demo
+To build and run the demo, follow the instructions below, which are written
+for Unix-like operating systems:
+* Obtain a copy of the Mahout distribution, either from SVN or as a
+downloaded archive.
+* Download the "1 Million MovieLens Dataset" from [](
+* Unpack the archive and copy movies.dat and ratings.dat to
+under the Mahout distribution directory.
+* Navigate to the directory where you unpacked the Mahout distribution, and
+navigate to trunk.
+* Run mvn -DskipTests install, which builds and installs Mahout core to
+your local repository
+* cd integration
+* You may need to give Maven more memory: in a bash shell, export
+* mvn jetty:run.
+* Get recommendations by accessing the web application in your browser:
+http://localhost:8080/mahout-integration/RecommenderServlet?userID=1 This
+will produce a simple preference-item ID list which could be consumed by a
+client application. Get more useful human-readable output with the debug
+<a name="RecommenderDocumentation-Examples"></a>
+## Examples
+<a name="RecommenderDocumentation-User-basedRecommender"></a>
+### User-based Recommender
+User-based recommenders are the "original", conventional style of
+recommender system. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+First, create a DataModel of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+    DataModel model = new FileDataModel(new File("data.txt"));
+We'll use the PearsonCorrelationSimilarity implementation of UserSimilarity
+as our user correlation algorithm, and add an optional preference inference
+    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+    // Optional:
+    userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
+Now we create a UserNeighborhood algorithm. Here we use nearest-3:
+    UserNeighborhood neighborhood =
+    	  new NearestNUserNeighborhood(3, userSimilarity, model);
+    Now we can create our Recommender, and add a caching decorator:
+Recommender recommender =
+	  new GenericUserBasedRecommender(model, neighborhood,
+Recommender cachingRecommender = new CachingRecommender(recommender);
+    Now we can get 10 recommendations for user ID "1234" &mdash; done!
+List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+    h3.Item-based Recommender
+    We could have created an item-based recommender instead. Item-based
+recommender base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, item-based
+recommenders can use pre-computed similarity values in the computations,
+which make them much faster. For large data sets, item-based recommenders
+are more appropriate.
+    Let's start over, again with a FileDataModel to start:
+DataModel model = new FileDataModel(new File("data.txt"));
+    We'll also need an ItemSimilarity. We could use
+PearsonCorrelationSimilarity, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of pre-computed correlations to a
+// Construct the list of pre-computed correlations
+Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+	  ...;
+ItemSimilarity itemSimilarity =
+	  new GenericItemSimilarity(correlations);
+    Then we can finish as before to produce recommendations:
+Recommender recommender =
+	  new GenericItemBasedRecommender(model, itemSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);<code></pre>
+    h3. Slope-One Recommender
+    This is a simple yet effective Recommender and we present another example
+to round out the list:
+DataModel model = new FileDataModel(new File("data.txt"));
+	  // Make a weighted slope one recommender
+	  Recommender recommender = new SlopeOneRecommender(model);
+	  Recommender cachingRecommender = new
+<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
+## Integration with your application
+<a name="RecommenderDocumentation-Direct"></a>
+### Direct
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+<a name="RecommenderDocumentation-Standaloneserver"></a>
+### Standalone server
+A Mahout recommender can also be run as an external server, which may be
+the only option for non-Java applications. It can be exposed as a web
+application via, and your
+application can then access recommendations via simple HTTP requests and
+response. See above, and see the javadoc for details.
+<a name="RecommenderDocumentation-Performance"></a>
+## Performance
+<a name="RecommenderDocumentation-RuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following command-line flags to
+your JVM:
+* -server: Enables the server VM, which is generally appropriate for
+long-running, computation-intensive applications.
+* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* -da -dsa: Disable all assertions.
+* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and third-party code you use doesn't call this method.
+Also consider the following tips:
+* Use CachingRecommender on top of your custom Recommender implementation.
+* When using JDBCDataModel, make sure you've taken basic steps to optimize
+the table storing preference data. Create a primary key on the user ID and
+item ID columns, and an index on them. Set them to be non-null. And so on.
+Tune your database for lots of concurrent reads! When using JDBC, the
+database is almost always the bottleneck. Plenty of memory and caching are
+even more important.
+* Also, pooling database connections is essential to performance. If using
+a J2EE container, it probably provides a way to configure connection pools.
+If you are creating your own DataSource directly, try wrapping it in
+* See MySQL-specific notes on performance in the javadoc for
+<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trial-and-error to find a setup that works best. The code sample above
+provides a good starting point.
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in"
+    DataModel myModel = ...;
+    RecommenderBuilder builder = new RecommenderBuilder() {
+      public Recommender buildRecommender(DataModel model) {
+        // build and return the Recommender to evaluate here
+      }
+    };
+    RecommenderEvaluator evaluator =
+    	  new AverageAbsoluteDifferenceRecommenderEvaluator();
+    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try this kind of evaluation, which presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful:
+    ...
+    RecommenderIRStatsEvaluator evaluator =
+    	new GenericRecommenderIRStatsEvaluator();
+    IRStatistics stats =
+    	evaluator.evaluate(builder, null, myModel, null, 3,
+    RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
+    	&sect;1.0);
+<a name="RecommenderDocumentation-UsefulLinks"></a>
+## Useful Links
+You'll want to look at these packages too, which offer more algorithms and
+approaches that you may find useful:
+* [Cofi](
+: A Java-Based Collaborative Filtering Library
+* [CoFE](
+Here's a handful of research papers that I've read and found particularly
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for Collaborative Filtering](
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation algorithms](
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285-295, 2001.
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture for collaborative filtering of netnews](
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175-186, 1994.
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing collaborative filtering](
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
+Clifford Lyon, "[Movie Recommender](
+" CSCI E-280 final project, Harvard University, 2004.
+Daniel Lemire, Anna Maclachlan, "[Slope One Predictors for Online Rating-Based Collaborative Filtering](
+," Proceedings of SIAM Data Mining (SDM '05), 2005.
+Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and Sean McGrath, "[RACOFI: A Rule-Applying Collaborative Filtering System](
+"," Proceedings of COLA '03, 2003.
+These links will take you to all the collaborative filtering reading you
+could ever want!
+* [Paul Perry's notes](
+* [James Thornton's collaborative filtering resources](
+* [Daniel Lemire's blog](
+ which frequently covers collaborative filtering topics

Added: mahout/site/trunk/content/recommender-first-timer-faq.mdtext
--- mahout/site/trunk/content/recommender-first-timer-faq.mdtext (added)
+++ mahout/site/trunk/content/recommender-first-timer-faq.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,50 @@
+Title: Recommender First-Timer FAQ
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
+For the interested, these topics are treated in detail in the book [Mahout in Action](
+Don't start with a distributed, Hadoop-based recommender; take on that
+complexity only if necessary. Start with non-distributed recommenders. It
+is simpler, has fewer requirements, and is more flexible. 
+As a crude rule of thumb, a system with up to 100M user-item associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a real-time recommender. The system
+is invariably memory-bound since keeping data in memory is essential to
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+If you want to do content-based item-item similarity, you need to implement
+your own ItemSimilarity.
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically -- perhaps on the order of minutes. If that's not OK,
+you'll have to look into some more specialized work -- SlopeOneRecommender
+deals with updates quickly, or, it is possible to do some work to update
+the GenericDataModel in real time. 

Added: mahout/site/trunk/content/reference-reading.mdtext
--- mahout/site/trunk/content/reference-reading.mdtext (added)
+++ mahout/site/trunk/content/reference-reading.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,216 @@
+Title: Reference Reading
+<a name="ReferenceReading-GeneralClustering"></a>
+# General Clustering
+<a name="ReferenceReading-Discussions"></a>
+## Discussions
+* [clustering tips and tricks ](
+<a name="ReferenceReading-TextClustering"></a>
+# Text Clustering
+<a name="ReferenceReading-ClusteringaspartofSearch"></a>
+## Clustering as part of Search
+* See Chapters on Hierarchical and Flat Clustering as part of search in the [stanford ir book](
+<a name="ReferenceReading-GeneralBackgroundMaterials"></a>
+# General Background Materials
+Can someone recommend me good books on Statistics and also on Linear
+and Analytic Geometry which will provide enough background for
+machine learning algorithms?
+<a name="ReferenceReading-"></a>
+The answers below focus on general background knowledge, rather than
+specifics of Mahout and associated Apache tooling. Feel free to add useful
+resources (books, but also videos, online courseware, tools), particularly
+those that are available free online.
+This page originated in an email thread, and its different contributors
+might not all agree on the best approach (and they might not know what's
+best for any given learner), but the resources here should give some idea
+of suitable background reading. Check the mailing list [archives](
+ if you care to figure out who-said-what, or find other suggestions.
+Don't be overwhelmed by all the maths, you can do a lot in Mahout with some
+basic knowledge. The resources given here will help you understand your
+data better, and ask better questions both of Mahout's APIs, and also of
+the Mahout community. And unlike learning some particular software tool,
+these are skills that will remain useful decades later.
+h3. Books and supporting materials on statistics, machine learning,
+matrices etc.:
+[Gilbert Strang](
+'s [Introduction to Linear Algebra|]
+ (*full text* online, highly recommended by several on the mahout list).
+His lectures are also [available online](
+ and are strongly recommended. See []
+"Mathematical Tools for Applied Mulitvariate Analysis" by J.Douglass
+[Stanford Machine Learning online courseware](
+"It's a very nicely taught course with super helpful lecture notes - and
+you can get all the videos in youtube or [iTunesU](
+"The [section notes](
+ for this course will give you enough review material on linear algebra and
+probability theory to get you going."
+[MIT Machine Learning online courseware](
+ (6.867) has [Lecture notes in PDF|]
+ online.
+As a pre-requisite to probability and statistics, you'll need [basic calculus](
+.  A maths for scientists text might be useful here such as 'Mathematics
+for Engineers and Scientists', Alan Jeffrey, Chapman & Hall/CRC.
+One of the best writers in the probability/statistics world is Sheldon
+Ross.  Try
+''A First Course in Probability (8th Edition), Pearson'' ([amazon](
+) and then move on to his ''Introduction to Probability Models (9th
+Edition), Academic
+Some good introductory alternatives here are:
+[Kahn Academy](
+ -- videos on stats, probability, linear algebra
+Probability and Statistics (7th Edition), Jay L. Devore, Chapman.
+Probability and Statistical Inference (7th Edition), Hogg and Tanis,
+Once you have a grasp of the basics then there are a slew of great texts
+that you might consult:  for example,
+Statistical Inference,	Casell and Berger, Duxbury/Thomson Learning.
+Most statistics books will have some sort of introduction to Bayesian
+methods, consider a specialist text, e.g.:
+Introduction to Bayesian Statistics (2nd Edition), William H. Bolstad,
+Then for the computational side of Bayesian (predominantly Markov chain
+Monte Carlo), e.g.
+Bolstad's Understanding Computational Bayesian Statistics, Wiley.
+Then you might try [Bayesian Data Analysis, Gelman et al., Chapman &Hall/CRC](
+On top of the books, [R](
+ \- is an indispensable software tool for visualizing distributions and
+doing calculations
+(another viewpoint)
+For statistics related to machine learning, I would avoid normal
+statistical texts and go with these instead
+[Pattern Recognition and Machine Learning by Chris Bishop](
+[Elements of Statistical Learning](
+ by Trevor Hastie, Robert Tibshirani, Jerome Friedman 
+Also [](
+matrix computations/decomposition/factorization etc.?
+[How's this one?](
+any idea? any other suggestion?
+I found the one by Peter V. O'Neil "Introduction to Linear Algebra", to be
+a great book for beginners
+(with some knowledge in calculus). It is not comprehensive, but, I believe,
+it will be a good place to start and the author starts by explaining the
+concepts with regards to vector spaces which I found to be a more natural
+way of explaining.[](
+David S. Watkins "Fundamentals of Matrix Computations (Pure and Applied
+Mathematics: A Wiley Series of Texts, Monographs and Tracts)"
+The Gollub / Van Loan text you mention is the classic text for numerical
+linear algebra.  Can't go wrong with it.  However, I'd also suggest you
+at Nick Trefethen's "Numerical Linear Algebra".  It's a bit more
+approachable for practitioners -- GVL is better suited for researchers.
+ (with some online lecture notes)
+I think this is the most relevant book for matrix math on distributed
+Many chapters on SVD, there are even chapters on Lanczos
+BTW what about R? There is literally tons of books in R series devoted
+to rather isolated problems but what would be a good crush course
+Ted Dunning:
+"I have found that learning about R is a difficult thing.  The best
+introduction I have seen is, paradoxically, not really a book about R and
+assumes a statistical mind-set that I disagree with.  That introduction is
+in MASS [](
+.  Other references also
+In addition, you should see how to plot data well:
+Generally, I learn more about R by watching people and reading code than by
+reading books.	There are many small tricks like how to format data
+optimally, how to restructure data.frames, common ways to plot data, which
+libraries do what and so on that an introductory book cannot convey general
+principles that will see you through to success."
+For Javascript/Web plotting: [](

Added: mahout/site/trunk/content/restricted-boltzmann-machines.mdtext
--- mahout/site/trunk/content/restricted-boltzmann-machines.mdtext (added)
+++ mahout/site/trunk/content/restricted-boltzmann-machines.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,43 @@
+Title: Restricted Boltzmann Machines
+NOTE: This implementation is a Work-In-Progress, at least till September,
+The JIRA issue is [here](
+<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a>
+### Boltzmann Machines
+Boltzmann Machines are a type of stochastic neural networks that closely
+resemble physical processes. They define a network of units with an overall
+energy that is evolved over a period of time, until it reaches thermal
+However, the convergence speed of Boltzmann machines that have
+unconstrained connectivity is low.
+<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a>
+### Restricted Boltzmann Machines
+Restricted Boltzmann Machines are a variant, that are 'restricted' in the
+sense that connections between hidden units of a single layer are _not_
+allowed. In addition, stacking multiple RBM's is also feasible, with the
+activities of the hidden units forming the base for a higher-level RBM. The
+combination of these two features renders RBM's highly usable for
+In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
+SVD and k-NN approaches, and contributed immensely to the final solution.
+<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a>
+### RBM's in Apache Mahout
+An implementation of Restricted Boltzmann Machines is being developed for
+Apache Mahout as a Google Summer of Code 2010 project. A recommender
+interface will also be provided. The key aims of the implementation are:
+1. Accurate - should replicate known results, including those of the Netflix
+1. Fast - The implementation uses Map-Reduce, hence, it should be fast
+1. Scale - Should scale to large datasets, with a design whose critical
+parts don't need a dependency between the amount of memory on your cluster
+systems and the size of your dataset
+You can view the patch as it develops [here](

Added: mahout/site/trunk/content/rowsimilarityjob.mdtext
--- mahout/site/trunk/content/rowsimilarityjob.mdtext (added)
+++ mahout/site/trunk/content/rowsimilarityjob.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,35 @@
+Title: RowSimilarityJob
+A brief description of RowSimilarityJob:
+(originally from [mailing list](
+The goal is to compute all pairwise similarities between the rows of a
+sparse matrix A.
+The computation should be executed in a way that only rows that have at
+least one non-zero value in the same dimension (column) are compared. We
+need this to avoid a quadratic number of pairwise comparisons.
+Furthermore we should be able to 'embed' arbitrary similarity measures
+and we should always be able to use a combiner in all MapReduce steps.
+The computation is executed using three MapReduce passes:
+In the first step, the rows of A are preprocessed via
+VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
+scaled to unit-length), a single number for each row of A is computed
+via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
+The second steps operates on the rows of A' (the columns of A). The
+mapper sums up all pairwise cooccurrences using
+VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
+called 'stripes' pattern). The reducers sums up all cooccurrence vectors
+for one row and uses the similarity measure and the precomputed numbers
+from step one to compute all similarities via
+The third step ensures that only the top k similar rows per row are kept.
+It's hard to see from the code but actually the job performs the matrix
+multiplication AA' via outer products with a modified (similarity
+measure specific) dot product.

Added: mahout/site/trunk/content/sample-clusters-animation.mdtext
--- mahout/site/trunk/content/sample-clusters-animation.mdtext (added)
+++ mahout/site/trunk/content/sample-clusters-animation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,53 @@
+Title: Sample Clusters Animation
+<a name="SampleClustersAnimation-DemoAnimation"></a>
+# Demo Animation
+This is an animation made from screen caps of all the 
+o.a.m.clustering.display.Display*.java demo apps.
+All of the programs used the same set of random samples. The ellipses show
+the different clustering algorithms. For more details, see the programs.
+_This animation was made with [giftedmotion.jar](
+. What a border!_
+<a name="SampleClustersAnimation-ScreenCaptures"></a>
+# Screen Captures
+<a name="SampleClustersAnimation-DisplayClustering"></a>
+## DisplayClustering
+The original random dataset with semi-illustrative ellipses.
+<a name=""></a>
+<a name=""></a>
+KMeans algorithm with _significance_ set to 5% or better.
+<a name=""></a>
+Dirichlet Process algorithm, based on a normal distribution, with
+_significance_ set to 5% or better.
+<a name=""></a>
+<a name=""></a>
+MeanShift variant of Canopy algorithm, with significance set to 2%. (In the
+code it seems like this should be 5%?)
+<a name=""></a>
+When this works. And when someone wants to redo the animation.

Added: mahout/site/trunk/content/spectral-clustering.mdtext
--- mahout/site/trunk/content/spectral-clustering.mdtext (added)
+++ mahout/site/trunk/content/spectral-clustering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,262 @@
+Title: Spectral Clustering
+Spectral clustering, a more powerful and specialized algorithm (compared to
+K-means), derives its name from spectral analysis of a graph, which is how
+the data are represented. Each object to be clustered can initially be
+represented as an _n_\-dimensional numeric vector, but the difference with
+this algorithm is that there must also be some method for performing a
+comparison between each object and expressing this comparison as a scalar.
+This _n_ by _n_ comparison of all objects with all others forms the
+_affinity_ matrix, which can be intuitively thought of as a rough
+representation of an underlying undirected, weighted, and fully-connected
+graph whose edges express the relative relationships, or affinities,
+between each pair of objects in the original data. This affinity matrix
+forms the basis from which the two spectral clustering algorithms operate.
+The equation by which the affinities are calculated can vary depending on
+the user's circumstances; typically, the equation takes the form of:
+exp( _-d^2_ / _c_ )
+where _d_ is the Euclidean distance between a pair of points, and _c_ is a
+scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
+closest points to the current point; all other affinities are set to 0
+outside of the neighborhood. Again, this formula can vary depending on the
+situation (e.g. a fully-connected graph would ignore the _k_\-neighborhood
+and calculate affinities for all pairs of points).
+[Full overview on spectral clustering](
+<a name="SpectralClustering-K-MeansSpectralClustering"></a>
+## K-Means Spectral Clustering
+<a name="SpectralClustering-Overview"></a>
+### Overview
+This consists of a few basic steps of generalized spectral clustering,
+followed by standard k-means clustering over the intermediate results.
+Again, this process begins with an affinity matrix *A* \- whether or not it
+is fully-connected depends on the user's need.
+*A* is then transformed into a pseudo-Laplacian matrix via a multiplication
+with a diagonal matrix whose entries consist of the sums of the rows of
+*A*. The sums are modified to be the inverse square root of their original
+values. The final operation looks something like:
+L = D^{-1/2} A D^{-1/2}
+*L* has some properties that are of interest to us; most importantly, while
+it is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
+decomposed into its constituent eigenvectors and corresponding eigenvalues
+(though the latter will not be needed for future calculations); the matrix
+of eigenvectors, *U*, is what we are now interested in.
+Assuming *U* is a column matrix (the eigenvectors comprise the columns),
+then we will now use the _rows_ of *U* as proxy data for the original data
+points. We will run each row through standard K-means clustering, and the
+label that each proxy point receives will be transparently assigned to the
+corresponding original data point, resulting in the final clustering
+[Full overview on k-means spectral clustering](
+<a name="SpectralClustering-Implementation"></a>
+### Implementation
+The Mahout implementation consists of a single driver -
+SpectralKMeansDriver - calling upon several common utilities. The driver
+performs the operations in sequential fashion: reading in and constructing
+the affinity matrix, building the diagonal matrix, building the
+pseudo-Laplacian and decomposing it, and clustering the components.
+The affinity matrix input is the most important part of this algorithm. It
+consists of text files which follow a specific format: that of a weighted,
+undirected graph. In order to represent a graph in text files, each line of
+a text file represents a single directional edge between two nodes. There
+are three comma-separated values on the line. The first number indicates
+the source node, the second is the destination node, and the third is the
+weight. For example:
+0, 1, 2.5
+would indicate the directional edge from node 0 to node 1 has a weight of
+2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix
+is symmetric, hence there should be a corresponding line in the text file
+of: 1, 0, 2.5.* Also, each node should be an integer value.
+*note*: per MAHOUT-978 don't name your input file with leading '_' (or
+M/R jobs written for SpectralKMeans:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+M/R jobs already in Mahout that were used:
+* DistributedRowMatrix.transpose()
+* DistributedLanczosSolver
+* EigenVerfierJob
+* KMeansDriver
+<a name="SpectralClustering-EigencutsSpectralClustering"></a>
+## Eigencuts Spectral Clustering
+<a name="SpectralClustering-Overview"></a>
+### Overview
+Intuitively, Eigencuts can be thought of as part 2 of the K-means
+algorithm, in that it performs the same initial steps up until the k-means
+clustering. The algorithm uses the same affinity matrix *A*, constructs the
+same diagonal matrix *D*, performs the same multiplication to create the
+pseudo-Laplacian *L*, and conducts an eigen-decomposition on *L* to obtain
+the eigenvectors and eigenvalues. But here is where the two algorithms
+For each eigenvector, we wish to determine how stable its flow of
+probability is within the underlying graph of the original data.
+Intuitively, this is intrinsically related to the min-cut, max-flow problem
+of finding bottlenecks: if we perturb the flow rate on a specific edge, and
+overall the flow is stable, then we can conclude that this edge was not a
+bottleneck. If, however, perturbing an edge significantly alters the
+overall flow, we know this edge's eigenflow is very unstable and is a
+We have an [explicit form](
+ of this "sensitivity" calculation ([full post here, under "computing
+The next step is called "non-maximal suppression", which effectively means
+we will ignore any of the calculated sensitivities for which there exists
+another more negative sensitivity in the same neighborhood as the current
+one, effectively "suppressing" it.
+This non-maximal suppression then plays a role in the final
+affinity-cutting step, where we "cut" the affinity (set to 0) between any
+two nodes (effectively destroying the edge between them) for which the
+sensitivity calculated at that edge passes some threshold, _and_ for which
+the sensitivity was _not_ suppressed in the previous step.
+Once the cutting has been completed, the process loops upon itself
+(starting with the recalculation of *D* using the modified *A*) until no
+new cuts in *A* are made in the final step.
+[Full overview on Eigencuts spectral clustering](
+<a name="SpectralClustering-Implementation"></a>
+### Implementation
+Since the first half of Eigencuts uses the same calculations as Spectral
+K-means, it uses the same common M/R tasks, both those specific to spectral
+clustering, as well as those general to Mahout. Unlike SpectralKMeans,
+however, there are no DistributedRowMatrix-specific operations performed,
+and hence there is no need for the data type at all; Mahout Vectors are
+used heavily instead.
+Once the initial affinity matrix is constructed, there is a loop within the
+EigencutsDriver over the calculation of *D*, the creation of *L* and its
+eigen-decomposition, the calculation of the sensitivities, and the actual
+affinity cuts, such that the loop terminates only when no cuts are made to
+the affinity matrix *A*. The final matrix *A* will then be representative
+of a graph structure whose data points exist in intra-connected clusters.
+M/R tasks specific to Eigencuts:
+* EigencutsSensitivityJob (calculates the perturbation effects on edge
+* EigencutsAffinityCutsJob (sets edge weights to 0)
+M/R tasks within spectral clustering:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+M/R tasks general to Mahout:
+* DistributedLanczosSolver
+* EigenVerifierJob
+<a name="SpectralClustering-Quickstart"></a>
+## Quickstart
+As noted before, the data for both these algorithms - the affinity matrix -
+is required to be symmetric. As of the first release, there is no built-in
+mechanism in Mahout for generating the affinities from raw data, as the
+formula this follows varies depending on the user's need, so it is left as
+an exercise to the user to generate the affinities prior to using these
+The affinity input should follow the standard format for textual
+representation of a graph, namely:
+node_i, node_j, value_ij
+For example, the following 3x3 affinity matrix:
+<tr><td> 0.0 </td><td> 0.8 </td><td> 0.5 </td></tr>
+<tr><td> 0.8 </td><td> 0.0 </td><td> 0.9 </td></tr>
+<tr><td> 0.5 </td><td> 0.9 </td><td> 0.0 </td></tr>
+would have the following textual input format:
+0, 0, 0
+0, 1, 0.8
+0, 2, 0.5
+1, 0, 0.8
+1, 1, 0
+1, 2, 0.9
+2, 0, 0.5
+2, 1, 0.9
+2, 2, 0
+Then simply invoke Spectral K-means or Eigencuts, using the input directory
+and setting the other parameters as necessary (e.g. "k" for K-means, "beta"
+for Eigencuts, etc).
+Format *rules*: node count begins at zero, is continuous. avoid whitespace,
+comments etc.
+<a name="SpectralClustering-Examples"></a>
+## Examples
+*NOTE: Am still waiting for Carnegie Mellon/Univ. of Pittsburgh approval
+for official data set*
+For these algorithms, it is useful to have a viable example, so I have
+created a small but effective synthetic data set to show how these
+algorithms operate. The raw data was generated synthetically and [can be viewed here](
+. It consists of 450 two-dimensional points drawn from 3 separate gaussian
+distributions their own means and standard deviations ([here is an image of
+the plotted
+This same data in affinity matrix form looks like this: [view the affinity data set|]
+In order to run the program, then, for spectral k-means:
+bin/mahout spectralkmeans \-i /path/to/directory/with/affinity/matrix \-o
+/output/path \-k 3 \-d 450
+and for eigencuts:
+bin/mahout eigencuts \-i /path/to/directory/with/affinity/matrix \-o
+/output/path \-b 2 \-d 450
+In both cases, the "-d" flag refers to the dimensionality of the affinity
+matrix; since there are 450 points, this value will be 450. In spectral
+k-means, the "-k" is the number of clusters to estimate. In Eigencuts, the
+"-b" refers to an initial estimate on the half-life of the flow of
+probability in the graph; higher estimates correspond to fewer clusters.
+Spectral k-means will eventually yield the clustering results, and there
+should only be a single mistake: point 54. This is because that particular
+point is in between the two left-hand clusters in the image, and so has
+high affinities with both clusters.
+[Full overview of example here](
+<a name="SpectralClustering-Resources"></a>
+### Resources
+* [](
+* [](
+* [](
+* [](

Added: mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext
--- mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext (added)
+++ mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,73 @@
+Title: Stochastic Singular Value Decomposition
+Stochastic SVD method in Mahout produces reduced rank Singular Value
+Decomposition output in its strict mathematical definition: A=USV'.
+<a name="StochasticSingularValueDecomposition-Thebenefitsoverothermethodsare:"></a>
+##### The benefits over other methods are: 
+* reduced flops required compared to Krylov subspace methods
+* In map-reduce world, a fixed number of MR iterations required regardless
+of rank requested
+* Tweak precision/speed balance with options.
+* A is a Distributed Row Matrix where rows may be identified by any
+Writable (such as a document path). As such, it would work directly on the
+output of seq2sparse. 
+* As of 0.7 trunk, includes PCA and dimensionality reduction workflow
+(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is
+greatly appreciated.)
+map-reduce characteristics: 
+SSVD uses at most 3 MR _sequential_ steps (map-only + map-reduce + 2
+optional parallel map-reduce jobs) to produce reduced rank approximation of
+U, V and S matrices. Additionally, two more map-reduce steps are added for
+each power iteration step if requested.
+<a name="StochasticSingularValueDecomposition-Potentialdrawbacks:"></a>
+##### Potential drawbacks: 
+* potentially less precise (but adding even one power iteration seems to
+fix that quite a bit).
+<a name="StochasticSingularValueDecomposition-Documentation"></a>
+##### Documentation
+[Overview and Usage](^ssvd-cli.pdf.html)
+Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.
+<a name="StochasticSingularValueDecomposition-Publications"></a>
+##### Publications
+[Nathan Halko's dissertation](
+ "Randomized methods for computing low-rank
+approximations of matrices" contains comprehensive definition of
+parallelization strategy taken in Mahout SSVD implementation and also some
+precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation
+on a typical corpus data set.
+<a name="StochasticSingularValueDecomposition-Rsimulation"></a>
+##### R simulation
+[Non-parallel SSVD simulation in R with power iterations and PCA options](^ssvd.r.html)
+. Note that this implementation is not most optimal for sequential flow
+solver, but it is for demonstration purposes only. 
+However, try this R code to simulate a meaningful input:
+<DIV class="code panel" style="border-style: solid;border-width: 1px;"><DIV class="codeHeader panelHeader" style="border-bottom-width: 1px;border-bottom-style: solid;"><B>tests.R</B></DIV><DIV class="codeContent panelContent">
+    n<-1000
+    m<-2000
+    k<-10
+    qi<-1
+    #simulated input
+    svalsim<-diag(k:1)
+    usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
+    vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
+    x<- usim %*% svalsim %*% t(vsim) 
+and try to compare ssvd.svd\(x\) and stock svd\(x\) performance for the
+same rank k, notice the difference in the running time. Also play with
+power iterations (qIter) and compare accuracies of standard svd and SSVD.

View raw message