Added: mahout/site/trunk/content/quickstart.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/quickstart.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/quickstart.mdtext (added)
+++ mahout/site/trunk/content/quickstart.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,63 @@
+Title: Quickstart
+<a name="QuickstartDownloadandInstallation"></a>
+# Download and Installation
+
+* [Download and installation](buildingmahout.html)
+
+<a name="QuickstartRunningtheExamples"></a>
+# Running the Examples
+
+Mahout runs on [Apache Hadoop ](http://hadoop.apache.org.html)
+. So, to run these examples, install the latest compatible [^1] [^2] [Hadoop Common release](http://hadoop.apache.org/common/)
+
+[^1]: When using a release, see the release notes
+[^2]: When using trunk,see <i>parent/pom.xml</i>
+
+<a name="QuickstartClustering"></a>
+## Clustering
+
+* [Clustering of synthetic control data](clusteringofsyntheticcontroldata.html)
+* [Visualizing Sample Clusters](visualizingsampleclusters.html)
+* [Clustering Seinfeld Episodes](clusteringseinfeldepisodes.html)
+
+
+<a name="QuickstartClassification"></a>
+## Classification
+
+* [Twenty Newsgroups](twentynewsgroups.html)
+* [Wikipedia Bayes Example](wikipediabayesexample.html)
+
+<a name="QuickstartGeneticProgramming"></a>
+## Genetic Programming
+
+* [Watchmaker](mahout.ga.tutorial.html)
+
+<a name="QuickstartDecisionForest"></a>
+## Decision Forest
+
+* [Breiman Example](breimanexample.html)
+* [Partial Implementation](partialimplementation.html)
+
+<a name="QuickstartRecommendationmining"></a>
+## Recommendation mining
+
+This package comes with four examples based on netflix data, bookcrossing,
+grouplens and jester data.
+
+* [RecommendationExamples](recommendationexamples.html)
+
+<a name="QuickstartWheretoGoNext"></a>
+# Where to Go Next
+
+* If you are working with text, read more on [Creating Vectors from Text](creatingvectorsfromtext.html)
+* To learn more on grouping items by similarity and identifying clusters
+read more on [Clustering your data](clusteringyourdata.html)
+* If you want to classify incoming items into predefinied categories read
+more on [Classifying your data ](classifyingyourdata.html)
+* To know how to Mine frequent patterns from your data read more on [Parallel Frequent Pattern Mining](parallelfrequentpatternmining.html)
+* To read more on building recommendation engines have a look at the [Taste documentation ](http://lucene.apache.org/mahout/taste.html)
+ and [Taste hadoop commandlineTasteCommandLine]
+
+Go back to the [Main Page](mahoutwiki.html)
+ for more information.
+
Added: mahout/site/trunk/content/randomforests.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/randomforests.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/randomforests.mdtext (added)
+++ mahout/site/trunk/content/randomforests.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,228 @@
+Title: Random Forests
+<a name="RandomForestsHowtogrowaDecisionTree"></a>
+### How to grow a Decision Tree
+
+source : \[3\](3\.html)
+
+LearnUnprunedTree(*X*,*Y*)
+
+Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
+the value of the *j*'th attribute in the *i*'th input datapoint. Each
+column consists of either all real values or all categorical values.
+Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
+class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
+Output: An Unpruned decision tree
+
+
+If all records in *X* have identical values in all their attributes (this
+includes the case where *R<2*), return a Leaf Node predicting the majority
+output, breaking ties randomly. This case also includes
+If all values in *Y* are the same, return a Leaf Node predicting this value
+as the output
+Else
+ select *m* variables at random out of the *M* variables
+ For *j* = 1 .. *m*
+ If *j*'th attribute is
+categorical
+*
+IG{*}{*}{~}j{~}* = IG(*Y*\*X{*}{*}{~}j{~}*) (see Information
+Gain)
+ Else (*j*'th attribute is
+realvalued)
+*
+IG{*}{*}{~}j{~}* = IG*(*Y*\*X{*}{*}{~}j{~}*) (see Information Gain)
+ Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
+splitting attribute we'll use)
+ If *j\** is categorical then
+ For each value *v* of the *j*'th
+attribute
+ Let
+*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
+Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
+ Let *Child{*}{*}{^}v{^}* =
+LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
+ Return a decision tree node,
+splitting on *j*'th attribute. The number of children equals the number of
+values of the *j*'th attribute, and the *v*'th child is
+*Child{*}{*}{^}v{^}*
+ Else *j\** is realvalued and let *t* be the best split
+threshold
+ Let *X{*}{*}{^}LO{^}* = subset
+of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
+corresponding subset of *Y*
+ Let *Child{*}{*}{^}LO{^}* =
+LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
+ Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
+in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
+subset of *Y*
+ Let *Child{*}{*}{^}HI{^}* =
+LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
+ Return a decision tree node, splitting on
+*j*'th attribute. It has two children corresponding to whether the *j*'th
+attribute is above or below the given threshold.
+
+*Note*: There are alternatives to Information Gain for splitting nodes
+
+
+<a name="RandomForestsInformationgain"></a>
+### Information gain
+
+source : \[3\](3\.html)
+1. h4. nominal attributes
+
+suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
+P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
+
+H(X)= \sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
+H(Y\X=v) = the entropy of Y among only those records in which X has value
+v
+H(Y\X) = sum{~}j~ p{~}j~ H(Y\X=v{~}j~)
+IG(Y\X) = H(Y)  H(Y\X)
+1. h4. realvalued attributes
+
+suppose X is real valued
+define IG(Y\X:t) as H(Y)  H(Y\X:t)
+define H(Y\X:t) = H(Y\X<t) P(X<t) + H(Y\X>=t) P(X>=t)
+define IG*(Y\X) = max{~}t~ IG(Y\X:t)
+
+<a name="RandomForestsHowtogrowaRandomForest"></a>
+### How to grow a Random Forest
+
+source : \[1\](1\.html)
+
+Each tree is grown as follows:
+1. if the number of cases in the training set is *N*, sample *N* cases at
+random \but with replacement, from the original data. This sample will be
+the training set for the growing tree.
+1. if there are *M* input variables, a number *m << M* is specified such
+that at each node, *m* variables are selected at random out of the *M* and
+the best split on these *m* is used to split the node. The value of *m* is
+held constant during the forest growing.
+1. each tree is grown to its large extent possible. There is no pruning.
+
+<a name="RandomForestsRandomForestparameters"></a>
+### Random Forest parameters
+
+source : \[2\](2\.html)
+Random Forests are easy to use, the only 2 parameters a user of the
+technique has to determine are the number of trees to be used and the
+number of variables (*m*) to be randomly selected from the available set of
+variables.
+Breinman's recommendations are to pick a large number of trees, as well as
+the square root of the number of variables for *m*.
+
+
+<a name="RandomForestsHowtopredictthelabelofacase"></a>
+### How to predict the label of a case
+
+Classify(*node*,*V*)
+ Input: *node* from the decision tree, if *node.attribute
+= j* then the split is done on the *j*'th attribute
+
+ Input: *V* a vector of *M* columns where
+*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
+ Output: label of *V*
+
+ If *node* is a Leaf then
+ Return the value predicted
+by *node*
+
+ Else
+ Let *j =
+node.attribute*
+ If *j* is
+categorical then
+
+
+Let *v* = *V{*}{*}{~}j{~}*
+
+
+Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
+value *v*
+
+ Return Classify(*child{*}{*}{^}v{^}*,*V*)
+
+ Else *j* is
+realvalued
+
+
+Let *t = node.threshold* (split threshold)
+
+ If Vj < t then
+
+ Let *child{*}{*}{^}LO{^}* = child
+node corresponding to (*<t*)
+
+ Return
+Classify(*child{*}{*}{^}LO{^}*,*V*)
+
+
+Else
+
+ Let *child{*}{*}{^}HI{^}* =
+child node corresponding to (*>=t*)
+
+ Return
+Classify(*child{*}{*}{^}HI{^}*,*V*)
+
+
+<a name="RandomForestsTheoutofbag(oob)errorestimation"></a>
+### The out of bag (oob) error estimation
+
+source : \[1\](1\.html)
+
+in random forests, there is no need for crossvalidation or a separate test
+set to get an unbiased estimate of the test set error. It is estimated
+internally, during the run, as follows:
+* each tree is constructed using a different bootstrap sample from the
+original data. About onethird of the cases left of the bootstrap sample
+and not used in the construction of the _kth_ tree.
+* put each case left out in the construction of the _kth_ tree down the
+_kth{_}tree to get a classification. In this way, a test set classification
+is obtained for each case in about onethrid of the trees. At the end of
+the run, take *j* to be the class that got mort of the the votes every time
+case *n* was _oob_. The proportion of times that *j* is not equal to the
+true class of *n* averaged over all cases is the _oob error estimate_. This
+has proven to be unbiased in many tests.
+
+<a name="RandomForestsOtherRFuses"></a>
+### Other RF uses
+
+source : \[1\](1\.html)
+* variable importance
+* gini importance
+* proximities
+* scaling
+* prototypes
+* missing values replacement for the training set
+* missing values replacement for the test set
+* detecting mislabeled cases
+* detecting outliers
+* detecting novelties
+* unsupervised learning
+* balancing prediction error
+Please refer to \[1\](1\.html)
+ for a detailed description
+
+<a name="RandomForestsReferences"></a>
+### References
+
+\[1\](1\.html)
+ Random Forests  Classification Description
+ [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
+\[2\](2\.html)
+ B. LariviÃ¨re & D. Van Den Poel, 2004. "Predicting Customer Retention
+and Profitability by Using Random Forests and Regression Forests
+Techniques,"
+ Working Papers of Faculty of
+Economics and Business Administration, Ghent University, Belgium 04/282,
+Ghent University,
+ Faculty of Economics and
+Business Administration.
+ Available online : [http://ideas.repec.org/p/rug/rugwps/04282.html](http://ideas.repec.org/p/rug/rugwps/04282.html)
+\[3\](3\.html)
+ Decision Trees  Andrew W. Moore\[4\]
+ http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html)
+\[4\](4\.html)
+ Information Gain  Andrew W. Moore
+ [http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials)
Added: mahout/site/trunk/content/recommendationexamples.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/recommendationexamples.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/recommendationexamples.mdtext (added)
+++ mahout/site/trunk/content/recommendationexamples.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,47 @@
+Title: RecommendationExamples
+<a name="RecommendationExamplesIntroduction"></a>
+# Introduction
+
+This quick start page describes how to run the recommendation examples
+provided by Mahout. Mahout comes with four recommendation mining examples.
+They are based on netflixx, jester, grouplens and bookcrossing
+respectively.
+
+<a name="RecommendationExamplesSteps"></a>
+# Steps
+
+<a name="RecommendationExamplesTestingitononesinglemachine"></a>
+## Testing it on one single machine
+
+In the examples directory type:
+
+ mvn q exec:java
+Dexec.mainClass="org.apache.mahout.cf.taste.example.bookcrossing.BookCrossingRecommenderEvaluatorRunner"
+Dexec.args="<OPTIONS>"
+ mvn q exec:java
+Dexec.mainClass="org.apache.mahout.cf.taste.example.netflix.NetflixRecommenderEvaluatorRunner"
+Dexec.args="<OPTIONS>"
+ mvn q exec:java
+Dexec.mainClass="org.apache.mahout.cf.taste.example.netflix.TransposeToByUser"
+Dexec.args="<OPTIONS>"
+ mvn q exec:java
+Dexec.mainClass="org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner"
+Dexec.args="<OPTIONS>"
+ mvn q exec:java
+Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"
+Dexec.args="<OPTIONS>"
+
+
+Here, the command line options need only be:
+
+
+ i [input file]
+
+
+
+Note that the GroupLens example is designed for the "1 million" data set,
+available at http://www.grouplens.org/node/73 . And the "input file" above
+is the ratings.dat contained in the zipfile from the data set . This file
+has an unusual format and so has a special parser. The example code here
+can be easily modified to use a regular FileDataModel and thus work on more
+standard input, including the other data sets available at this site.
Added: mahout/site/trunk/content/recommenderdocumentation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/recommenderdocumentation.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/recommenderdocumentation.mdtext (added)
+++ mahout/site/trunk/content/recommenderdocumentation.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,394 @@
+Title: Recommender Documentation
+<a name="RecommenderDocumentationOverview"></a>
+## Overview
+
+_This documentation concerns the nondistributed, nonHadoopbased
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoopbased code. It may be viewed as a
+somewhat separate, older, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoopbased
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+
+A Mahoutbased collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterpriseready; it's designed for performance, scalability
+and flexibility.
+
+Mahout recommenders are not just for Java; it can be run as an external
+server which exposes recommendation logic to your application via web
+services and HTTP.
+
+Toplevel packages define the Mahout interfaces to these key abstractions:
+* DataModel
+* UserSimilarity
+* ItemSimilarity
+* UserNeighborhood
+* Recommender
+
+Subpackages of org.apache.mahout.cf.taste.impl hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! For the academically inclined, Mahout
+supports both *memorybased*, *itembased* recommender systems, *slope one*
+recommenders, and a couple other experimental implementations. It does not
+currently support *modelbased* recommenders.
+
+<a name="RecommenderDocumentationArchitecture"></a>
+## Architecture
+
+![tastearchitecture](attachments/22872433/23003157.png)
+
+This diagram shows the relationship between various Mahout components in a
+userbased recommender. An itembased recommender system is similar except
+that there are no PreferenceInferrers or Neighborhood algorithms involved.
+
+<a name="RecommenderDocumentationRecommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+GenericUserBasedRecommender implementation GenericItemBasedRecommender,
+possibly decorated by CachingRecommender.
+
+<a name="RecommenderDocumentationDataModel"></a>
+### DataModel
+A DataModel is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Mahout provides MySQLJDBCDataModel, for example, to
+access preference data from a database via JDBC and MySQL. Another exists
+for PostgreSQL. Mahout also provides a FileDataModel.
+
+There are no abstractions for a user or item in the object model (not
+anymore). Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A Preference object or PreferenceArray object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+
+Finally, Mahout supports, in various ways, a socalled "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+
+<a name="RecommenderDocumentationUserSimilarity"></a>
+### UserSimilarity
+A UserSimilarity defines a notion of similarity between two Users. This is
+a crucial part of a recommendation engine. These are attached to a
+Neighborhood implementation. ItemSimilarities are analagous, but find
+similarity between Items.
+
+<a name="RecommenderDocumentationUserNeighborhood"></a>
+### UserNeighborhood
+In a userbased recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A UserNeighborhood
+defines a means of determining that neighborhood — for example,
+nearest 10 users. Implementations typically need a UserSimilarity to
+operate.
+
+<a name="RecommenderDocumentationRequirements"></a>
+## Requirements
+<a name="RecommenderDocumentationRequired"></a>
+### Required
+
+* [Java/ J2SE 6.0](http://www.java.com/getjava/index.jsp)
+
+<a name="RecommenderDocumentationOptional"></a>
+### Optional
+* [Apache Maven](http://maven.apache.org)
+ 2.2.1 or later, if you want to build from source or build examples. (Mac
+users note that even OS X 10.5 ships with Maven 2.0.6, which will not
+work.)
+* Mahout web applications require a [Servlet 2.3+](http://java.sun.com/products/servlet/index.jsp)
+ container, such as [Apache Tomcathttp://jakarta.apache.org/tomcat/]
+. It may in fact work with oldercontainers with slight modification.
+
+<a name="RecommenderDocumentationDemo"></a>
+## Demo
+
+To build and run the demo, follow the instructions below, which are written
+for Unixlike operating systems:
+
+* Obtain a copy of the Mahout distribution, either from SVN or as a
+downloaded archive.
+* Download the "1 Million MovieLens Dataset" from [Grouplens.org](http://www.grouplens.org/)
+* Unpack the archive and copy movies.dat and ratings.dat to
+trunk/integration/src/main/resources/org/apache/mahout/cf/taste/example/grouplens
+under the Mahout distribution directory.
+* Navigate to the directory where you unpacked the Mahout distribution, and
+navigate to trunk.
+* Run mvn DskipTests install, which builds and installs Mahout core to
+your local repository
+* cd integration
+* You may need to give Maven more memory: in a bash shell, export
+MAVEN_OPTS=Xmx1024M
+* mvn jetty:run.
+* Get recommendations by accessing the web application in your browser:
+http://localhost:8080/mahoutintegration/RecommenderServlet?userID=1 This
+will produce a simple preferenceitem ID list which could be consumed by a
+client application. Get more useful humanreadable output with the debug
+parameter:
+http://localhost:8080/mahoutintegration/RecommenderServlet?userID=1&debug=true
+
+
+<a name="RecommenderDocumentationExamples"></a>
+## Examples
+<a name="RecommenderDocumentationUserbasedRecommender"></a>
+### Userbased Recommender
+Userbased recommenders are the "original", conventional style of
+recommender system. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+
+First, create a DataModel of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+
+
+ DataModel model = new FileDataModel(new File("data.txt"));
+
+
+We'll use the PearsonCorrelationSimilarity implementation of UserSimilarity
+as our user correlation algorithm, and add an optional preference inference
+algorithm:
+
+
+ UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+ // Optional:
+ userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
+
+
+Now we create a UserNeighborhood algorithm. Here we use nearest3:
+
+
+ UserNeighborhood neighborhood =
+ new NearestNUserNeighborhood(3, userSimilarity, model);
+
+ Now we can create our Recommender, and add a caching decorator:
+
+
+Recommender recommender =
+ new GenericUserBasedRecommender(model, neighborhood,
+userSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+
+
+ Now we can get 10 recommendations for user ID "1234" — done!
+
+List<RecommendedItem> recommendations =
+ cachingRecommender.recommend(1234, 10);
+
+
+ h3.Itembased Recommender
+
+ We could have created an itembased recommender instead. Itembased
+recommender base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, itembased
+recommenders can use precomputed similarity values in the computations,
+which make them much faster. For large data sets, itembased recommenders
+are more appropriate.
+
+ Let's start over, again with a FileDataModel to start:
+
+
+DataModel model = new FileDataModel(new File("data.txt"));
+
+
+ We'll also need an ItemSimilarity. We could use
+PearsonCorrelationSimilarity, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of precomputed correlations to a
+GenericItemSimilarity:
+
+
+// Construct the list of precomputed correlations
+Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+ ...;
+ItemSimilarity itemSimilarity =
+ new GenericItemSimilarity(correlations);
+
+
+
+ Then we can finish as before to produce recommendations:
+
+<pre><code>
+Recommender recommender =
+ new GenericItemBasedRecommender(model, itemSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+...
+List<RecommendedItem> recommendations =
+ cachingRecommender.recommend(1234, 10);<code></pre>
+
+
+ h3. SlopeOne Recommender
+ This is a simple yet effective Recommender and we present another example
+to round out the list:
+
+<pre><code>
+DataModel model = new FileDataModel(new File("data.txt"));
+ // Make a weighted slope one recommender
+ Recommender recommender = new SlopeOneRecommender(model);
+ Recommender cachingRecommender = new
+CachingRecommender(recommender);</code></pre>
+
+
+
+
+<a name="RecommenderDocumentationIntegrationwithyourapplication"></a>
+## Integration with your application
+<a name="RecommenderDocumentationDirect"></a>
+### Direct
+
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+
+<a name="RecommenderDocumentationStandaloneserver"></a>
+### Standalone server
+A Mahout recommender can also be run as an external server, which may be
+the only option for nonJava applications. It can be exposed as a web
+application via org.apach.mahout.cf.taste.web.RecommenderServlet, and your
+application can then access recommendations via simple HTTP requests and
+response. See above, and see the javadoc for details.
+
+<a name="RecommenderDocumentationPerformance"></a>
+## Performance
+<a name="RecommenderDocumentationRuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following commandline flags to
+your JVM:
+
+* server: Enables the server VM, which is generally appropriate for
+longrunning, computationintensive applications.
+* Xms1024m Xmx1024m: Make the heap as big as possible  a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* da dsa: Disable all assertions.
+* XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* XX:+UseParallelGC XX:+UseParallelOldGC (multiprocessor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* XX:DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and thirdparty code you use doesn't call this method.
+
+Also consider the following tips:
+
+* Use CachingRecommender on top of your custom Recommender implementation.
+* When using JDBCDataModel, make sure you've taken basic steps to optimize
+the table storing preference data. Create a primary key on the user ID and
+item ID columns, and an index on them. Set them to be nonnull. And so on.
+Tune your database for lots of concurrent reads! When using JDBC, the
+database is almost always the bottleneck. Plenty of memory and caching are
+even more important.
+* Also, pooling database connections is essential to performance. If using
+a J2EE container, it probably provides a way to configure connection pools.
+If you are creating your own DataSource directly, try wrapping it in
+org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
+* See MySQLspecific notes on performance in the javadoc for
+MySQLJDBCDataModel.
+
+<a name="RecommenderDocumentationAlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trialanderror to find a setup that works best. The code sample above
+provides a good starting point.
+
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in org.apache.mahout.cf.taste.eval"
+
+
+ DataModel myModel = ...;
+ RecommenderBuilder builder = new RecommenderBuilder() {
+ public Recommender buildRecommender(DataModel model) {
+ // build and return the Recommender to evaluate here
+ }
+ };
+ RecommenderEvaluator evaluator =
+ new AverageAbsoluteDifferenceRecommenderEvaluator();
+ double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+
+
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try this kind of evaluation, which presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful:
+
+
+ ...
+ RecommenderIRStatsEvaluator evaluator =
+ new GenericRecommenderIRStatsEvaluator();
+ IRStatistics stats =
+ evaluator.evaluate(builder, null, myModel, null, 3,
+ RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
+ §1.0);
+
+
+
+<a name="RecommenderDocumentationUsefulLinks"></a>
+## Useful Links
+You'll want to look at these packages too, which offer more algorithms and
+approaches that you may find useful:
+
+* [Cofi](http://www.nongnu.org/cofi/)
+: A JavaBased Collaborative Filtering Library
+* [CoFE](http://eecs.oregonstate.edu/iis/CoFE/)
+
+Here's a handful of research papers that I've read and found particularly
+useful:
+
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for Collaborative Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Itembased collaborative filtering recommendation algorithms](http://www10.org/cdrom/papers/519/)
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285295, 2001.
+
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture for collaborative filtering of netnews](http://doi.acm.org/10.1145/192844.192905)
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175186, 1994.
+
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing collaborative filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230237,
+1999.
+
+Clifford Lyon, "[Movie Recommender](http://materialobjects.com/cf/MovieRecommender.pdf)
+" CSCI E280 final project, Harvard University, 2004.
+
+Daniel Lemire, Anna Maclachlan, "[Slope One Predictors for Online RatingBased Collaborative Filtering](http://www.daniellemire.com/fr/abstracts/SDM2005.html)
+," Proceedings of SIAM Data Mining (SDM '05), 2005.
+
+Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and Sean McGrath, "[RACOFI: A RuleApplying Collaborative Filtering System](http://www.daniellemire.com/fr/documents/publications/racofi_nrc.pdf)
+"," Proceedings of COLA '03, 2003.
+
+These links will take you to all the collaborative filtering reading you
+could ever want!
+* [Paul Perry's notes](http://www.paulperry.net/notes/cf.asp)
+* [James Thornton's collaborative filtering resources](http://jamesthornton.com/cf/)
+* [Daniel Lemire's blog](http://www.daniellemire.com/blog/)
+ which frequently covers collaborative filtering topics
Added: mahout/site/trunk/content/recommenderfirsttimerfaq.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/recommenderfirsttimerfaq.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/recommenderfirsttimerfaq.mdtext (added)
+++ mahout/site/trunk/content/recommenderfirsttimerfaq.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,50 @@
+Title: Recommender FirstTimer FAQ
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rulesofthumb to
+newcomers.
+
+For the interested, these topics are treated in detail in the book [Mahout in Action](http://manning.com/owen/)
+.
+
+Don't start with a distributed, Hadoopbased recommender; take on that
+complexity only if necessary. Start with nondistributed recommenders. It
+is simpler, has fewer requirements, and is more flexible.
+
+As a crude rule of thumb, a system with up to 100M useritem associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a realtime recommender. The system
+is invariably memorybound since keeping data in memory is essential to
+performance.
+
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+
+If you want to do contentbased itemitem similarity, you need to implement
+your own ItemSimilarity.
+
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically  perhaps on the order of minutes. If that's not OK,
+you'll have to look into some more specialized work  SlopeOneRecommender
+deals with updates quickly, or, it is possible to do some work to update
+the GenericDataModel in real time.
Added: mahout/site/trunk/content/referencereading.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/referencereading.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/referencereading.mdtext (added)
+++ mahout/site/trunk/content/referencereading.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,216 @@
+Title: Reference Reading
+<a name="ReferenceReadingGeneralClustering"></a>
+# General Clustering
+
+<a name="ReferenceReadingDiscussions"></a>
+## Discussions
+
+* [clustering tips and tricks ](http://www.lucidimagination.com/search/document/1c3561d17fc1b81c/clustering_techniques_tips_and_tricks)
+
+<a name="ReferenceReadingTextClustering"></a>
+# Text Clustering
+
+<a name="ReferenceReadingClusteringaspartofSearch"></a>
+## Clustering as part of Search
+
+* See Chapters on Hierarchical and Flat Clustering as part of search in the [stanford ir book](http://nlp.stanford.edu/IRbook/html/htmledition/irbook.html)
+
+<a name="ReferenceReadingGeneralBackgroundMaterials"></a>
+# General Background Materials
+
+[Q:](http://mailarchives.apache.org/mod_mbox/mahoutuser/201103.mbox/%3CAANLkTi=c8kGHjvbfTCt0GyjKvvMCq=EaFsCr=_s=ykDE@mail.gmail.com%3E)
+Can someone recommend me good books on Statistics and also on Linear
+Algebra
+and Analytic Geometry which will provide enough background for
+understanding
+machine learning algorithms?
+
+<a name="ReferenceReading"></a>
+##
+
+The answers below focus on general background knowledge, rather than
+specifics of Mahout and associated Apache tooling. Feel free to add useful
+resources (books, but also videos, online courseware, tools), particularly
+those that are available free online.
+
+This page originated in an email thread, and its different contributors
+might not all agree on the best approach (and they might not know what's
+best for any given learner), but the resources here should give some idea
+of suitable background reading. Check the mailing list [archives](http://mailarchives.apache.org/mod_mbox/mahoutuser/)
+ if you care to figure out whosaidwhat, or find other suggestions.
+
+Don't be overwhelmed by all the maths, you can do a lot in Mahout with some
+basic knowledge. The resources given here will help you understand your
+data better, and ask better questions both of Mahout's APIs, and also of
+the Mahout community. And unlike learning some particular software tool,
+these are skills that will remain useful decades later.
+
+h3. Books and supporting materials on statistics, machine learning,
+matrices etc.:
+
+[Gilbert Strang](http://wwwmath.mit.edu/~gs)
+'s [Introduction to Linear Algebrahttp://math.mit.edu/linearalgebra/]
+ (*full text* online, highly recommended by several on the mahout list).
+([openlibrary](http://openlibrary.org/works/OL3285486W/Introduction_to_linear_algebra)
+)
+His lectures are also [available online](http://web.mit.edu/18.06/www/)
+ and are strongly recommended. See [http://ocw.mit.edu/courses/mathematics/1806linearalgebraspring2010/]
+
+
+"Mathematical Tools for Applied Mulitvariate Analysis" by J.Douglass
+Carroll.
+([amazon](http://www.amazon.com/MathematicalToolsAppliedMultivariateAnalysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=81)
+)
+
+
+
+[Stanford Machine Learning online courseware](http://www.stanford.edu/class/cs229/)
+(cs229.stanford.edu):
+
+"It's a very nicely taught course with super helpful lecture notes  and
+you can get all the videos in youtube or [iTunesU](http://itunes.apple.com/itunesu/machinelearning/id384233048)
+"
+
+"The [section notes](http://www.stanford.edu/class/cs229/materials.html)
+ for this course will give you enough review material on linear algebra and
+probability theory to get you going."
+
+[MIT Machine Learning online courseware](http://ocw.mit.edu/courses/electricalengineeringandcomputerscience/6867machinelearningfall2006/)
+ (6.867) has [Lecture notes in PDFhttp://ocw.mit.edu/courses/electricalengineeringandcomputerscience/6867machinelearningfall2006/lecturenotes/]
+ online.
+
+As a prerequisite to probability and statistics, you'll need [basic calculus](http://en.wikipedia.org/wiki/Calculus)
+. A maths for scientists text might be useful here such as 'Mathematics
+for Engineers and Scientists', Alan Jeffrey, Chapman & Hall/CRC.
+([openlibrary](http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists)
+)
+
+
+One of the best writers in the probability/statistics world is Sheldon
+Ross. Try
+''A First Course in Probability (8th Edition), Pearson'' ([amazon](http://www.pearsonhighered.com/educator/product/FirstCourseinProbabilityA/9780136033134.page)
+) and then move on to his ''Introduction to Probability Models (9th
+Edition), Academic
+Press.''([amazonhttp://www.amazon.com/IntroductionProbabilityModelsSixthSheldon/dp/0125984707])
+
+
+Some good introductory alternatives here are:
+
+[Kahn Academy](http://www.khanacademy.org/)
+  videos on stats, probability, linear algebra
+
+Probability and Statistics (7th Edition), Jay L. Devore, Chapman.
+([amazon](http://www.amazon.com/ProbabilityStatisticsEngineeringSciencesInfoTrac/dp/0534399339)
+)
+
+Probability and Statistical Inference (7th Edition), Hogg and Tanis,
+Pearson.
+([amazon](http://www.amazon.com/ProbabilityStatisticalInferenceRobertHogg/dp/0132546086)
+)
+
+Once you have a grasp of the basics then there are a slew of great texts
+that you might consult: for example,
+
+Statistical Inference, Casell and Berger, Duxbury/Thomson Learning.
+([amazon](http://www.amazon.com/StatisticalInferenceGeorgeCasella/dp/0534243126)
+)
+
+Most statistics books will have some sort of introduction to Bayesian
+methods, consider a specialist text, e.g.:
+
+Introduction to Bayesian Statistics (2nd Edition), William H. Bolstad,
+Wiley.
+([amazon](http://www.amazon.com/IntroductionBayesianStatisticsWilliamBolstad/dp/0471270202)
+)
+
+Then for the computational side of Bayesian (predominantly Markov chain
+Monte Carlo), e.g.
+Bolstad's Understanding Computational Bayesian Statistics, Wiley.
+([amazon](http://www.amazon.com/UnderstandingComputationalBayesianStatisticsWiley/dp/0470046090)
+)
+
+Then you might try [Bayesian Data Analysis, Gelman et al., Chapman &Hall/CRC](http://www.stat.columbia.edu/~gelman/book/)
+
+On top of the books, [R](http://en.wikipedia.org/wiki/R_(programming_language))
+ \ is an indispensable software tool for visualizing distributions and
+doing calculations
+
+
+
+(another viewpoint)
+
+For statistics related to machine learning, I would avoid normal
+statistical texts and go with these instead
+
+[Pattern Recognition and Machine Learning by Chris Bishop](http://research.microsoft.com/enus/um/people/cmbishop/PRML/index.htm)
+
+[Elements of Statistical Learning](http://wwwstat.stanford.edu/~tibs/ElemStatLearn/)
+ by Trevor Hastie, Robert Tibshirani, Jerome Friedman
+
+Also [http://research.microsoft.com/enus/um/people/cmbishop/PRML/index.htm](http://research.microsoft.com/enus/um/people/cmbishop/PRML/index.htm)
+
+matrix computations/decomposition/factorization etc.?
+
+[How's this one?](http://www.amazon.com/gp/product/0801854148/ref=s9_simh_gw_p14_d0_i1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center3&pf_rd_r=0ESQ3KDY8MJ1AWWG8PFR&pf_rd_t=101&pf_rd_p=470938811&pf_rd_i=507846)
+any idea? any other suggestion?
+
+I found the one by Peter V. O'Neil "Introduction to Linear Algebra", to be
+a great book for beginners
+(with some knowledge in calculus). It is not comprehensive, but, I believe,
+it will be a good place to start and the author starts by explaining the
+concepts with regards to vector spaces which I found to be a more natural
+way of explaining.[http://www.amazon.com/IntroductionLinearAlgebraTheoryApplications/dp/053400606X](http://www.amazon.com/IntroductionLinearAlgebraTheoryApplications/dp/053400606X)
+
+David S. Watkins "Fundamentals of Matrix Computations (Pure and Applied
+Mathematics: A Wiley Series of Texts, Monographs and Tracts)"
+[http://www.amazon.com/FundamentalsMatrixComputationsAppliedMathematics/dp/0470528338/](http://www.amazon.com/FundamentalsMatrixComputationsAppliedMathematics/dp/0470528338/)
+
+
+
+The Gollub / Van Loan text you mention is the classic text for numerical
+linear algebra. Can't go wrong with it. However, I'd also suggest you
+look
+at Nick Trefethen's "Numerical Linear Algebra". It's a bit more
+approachable for practitioners  GVL is better suited for researchers.
+[http://people.maths.ox.ac.uk/trefethen/books.html](http://people.maths.ox.ac.uk/trefethen/books.html)
+[http://people.maths.ox.ac.uk/trefethen/text.html](http://people.maths.ox.ac.uk/trefethen/text.html)
+ (with some online lecture notes)
+
+
+I think this is the most relevant book for matrix math on distributed
+systems:
+
+[http://www.amazon.com/NumericalLinearAlgebraLloydTrefethen/dp/0898713617](http://www.amazon.com/NumericalLinearAlgebraLloydTrefethen/dp/0898713617)
+Many chapters on SVD, there are even chapters on Lanczos
+
+
+BTW what about R? There is literally tons of books in R series devoted
+to rather isolated problems but what would be a good crush course
+book?
+
+
+Ted Dunning:
+
+"I have found that learning about R is a difficult thing. The best
+introduction I have seen is, paradoxically, not really a book about R and
+assumes a statistical mindset that I disagree with. That introduction is
+in MASS [http://www.stats.ox.ac.uk/pub/MASS4/](http://www.stats.ox.ac.uk/pub/MASS4/)
+. Other references also
+exist:
+
+[http://www.rtutor.com/rintroduction](http://www.rtutor.com/rintroduction)
+[http://cran.rproject.org/doc/manuals/Rintro.pdf](http://cran.rproject.org/doc/manuals/Rintro.pdf)
+[http://faculty.washington.edu/tlumley/Rcourse/](http://faculty.washington.edu/tlumley/Rcourse/)
+
+In addition, you should see how to plot data well:
+
+[http://www.statmethods.net/advgraphs/trellis.html](http://www.statmethods.net/advgraphs/trellis.html)
+[http://had.co.nz/ggplot2/](http://had.co.nz/ggplot2/)
+
+Generally, I learn more about R by watching people and reading code than by
+reading books. There are many small tricks like how to format data
+optimally, how to restructure data.frames, common ways to plot data, which
+libraries do what and so on that an introductory book cannot convey general
+principles that will see you through to success."
+
+For Javascript/Web plotting: [http://www.1stwebdesigner.com/css/topjquerychartlibrariesinteractivecharts/](http://www.1stwebdesigner.com/css/topjquerychartlibrariesinteractivecharts/)
Added: mahout/site/trunk/content/restrictedboltzmannmachines.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/restrictedboltzmannmachines.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/restrictedboltzmannmachines.mdtext (added)
+++ mahout/site/trunk/content/restrictedboltzmannmachines.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,43 @@
+Title: Restricted Boltzmann Machines
+NOTE: This implementation is a WorkInProgress, at least till September,
+2010.
+
+The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT375)
+.
+
+<a name="RestrictedBoltzmannMachinesBoltzmannMachines"></a>
+### Boltzmann Machines
+Boltzmann Machines are a type of stochastic neural networks that closely
+resemble physical processes. They define a network of units with an overall
+energy that is evolved over a period of time, until it reaches thermal
+equilibrium.
+
+However, the convergence speed of Boltzmann machines that have
+unconstrained connectivity is low.
+
+<a name="RestrictedBoltzmannMachinesRestrictedBoltzmannMachines"></a>
+### Restricted Boltzmann Machines
+Restricted Boltzmann Machines are a variant, that are 'restricted' in the
+sense that connections between hidden units of a single layer are _not_
+allowed. In addition, stacking multiple RBM's is also feasible, with the
+activities of the hidden units forming the base for a higherlevel RBM. The
+combination of these two features renders RBM's highly usable for
+parallelization.
+
+In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
+SVD and kNN approaches, and contributed immensely to the final solution.
+
+<a name="RestrictedBoltzmannMachinesRBM'sinApacheMahout"></a>
+### RBM's in Apache Mahout
+An implementation of Restricted Boltzmann Machines is being developed for
+Apache Mahout as a Google Summer of Code 2010 project. A recommender
+interface will also be provided. The key aims of the implementation are:
+1. Accurate  should replicate known results, including those of the Netflix
+Prize
+1. Fast  The implementation uses MapReduce, hence, it should be fast
+1. Scale  Should scale to large datasets, with a design whose critical
+parts don't need a dependency between the amount of memory on your cluster
+systems and the size of your dataset
+
+You can view the patch as it develops [here](http://github.com/sisirkoppaka/mahoutrbm/compare/trunk...rbm)
+.
Added: mahout/site/trunk/content/rowsimilarityjob.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/rowsimilarityjob.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/rowsimilarityjob.mdtext (added)
+++ mahout/site/trunk/content/rowsimilarityjob.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,35 @@
+Title: RowSimilarityJob
+A brief description of RowSimilarityJob:
+
+(originally from [mailing list](http://mailarchives.apache.org/mod_mbox/mahoutuser/201202.mbox/browser)
+)
+
+The goal is to compute all pairwise similarities between the rows of a
+sparse matrix A.
+
+The computation should be executed in a way that only rows that have at
+least one nonzero value in the same dimension (column) are compared. We
+need this to avoid a quadratic number of pairwise comparisons.
+Furthermore we should be able to 'embed' arbitrary similarity measures
+and we should always be able to use a combiner in all MapReduce steps.
+
+The computation is executed using three MapReduce passes:
+
+In the first step, the rows of A are preprocessed via
+VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
+scaled to unitlength), a single number for each row of A is computed
+via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
+
+The second steps operates on the rows of A' (the columns of A). The
+mapper sums up all pairwise cooccurrences using
+VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
+called 'stripes' pattern). The reducers sums up all cooccurrence vectors
+for one row and uses the similarity measure and the precomputed numbers
+from step one to compute all similarities via
+VectorSimilarityMeasure.similarity().
+
+The third step ensures that only the top k similar rows per row are kept.
+
+It's hard to see from the code but actually the job performs the matrix
+multiplication AA' via outer products with a modified (similarity
+measure specific) dot product.
Added: mahout/site/trunk/content/sampleclustersanimation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/sampleclustersanimation.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/sampleclustersanimation.mdtext (added)
+++ mahout/site/trunk/content/sampleclustersanimation.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,53 @@
+Title: Sample Clusters Animation
+<a name="SampleClustersAnimationDemoAnimation"></a>
+# Demo Animation
+
+This is an animation made from screen caps of all the
+o.a.m.clustering.display.Display*.java demo apps.
+
+![animation](attachments/27825557/28016730.gif)
+
+All of the programs used the same set of random samples. The ellipses show
+the different clustering algorithms. For more details, see the programs.
+
+_This animation was made with [giftedmotion.jar](http://www.onyxbits.de/giftedmotion)
+. What a border!_
+<a name="SampleClustersAnimationScreenCaptures"></a>
+# Screen Captures
+<a name="SampleClustersAnimationDisplayClustering"></a>
+## DisplayClustering
+The original random dataset with semiillustrative ellipses.
+
+![Clustering](attachments/27825557/28016732.png)
+
+<a name="SampleClustersAnimationDisplayCanopy.java"></a>
+## DisplayCanopy.java
+![Canopy](attachments/27825557/28016731.png)
+<a name="SampleClustersAnimationDisplayKMeans.java"></a>
+## DisplayKMeans.java
+KMeans algorithm with _significance_ set to 5% or better.
+
+![KMeans_5%min](attachments/27825557/28016735.png)
+<a name="SampleClustersAnimationDisplayDirichlet.java"></a>
+## DisplayDirichlet.java
+Dirichlet Process algorithm, based on a normal distribution, with
+_significance_ set to 5% or better.
+
+![DirichletProcess_NormalDistribution_5%min](attachments/27825557/28016733.png)
+
+<a name="SampleClustersAnimationDisplayFuzzyKMeans.java"></a>
+## DisplayFuzzyKMeans.java
+
+![FuzzyKMeans_5%min](attachments/27825557/28016734.png)
+<a name="SampleClustersAnimationDisplayMeanShift.java"></a>
+## DisplayMeanShift.java
+MeanShift variant of Canopy algorithm, with significance set to 2%. (In the
+code it seems like this should be 5%?)
+![MeanShiftCanopy_2%min](attachments/27825557/28016736.png)
+
+<a name="SampleClustersAnimationDisplaySpectralKMeans.java"></a>
+## DisplaySpectralKMeans.java
+When this works. And when someone wants to redo the animation.
+
+
+
Added: mahout/site/trunk/content/spectralclustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/spectralclustering.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/spectralclustering.mdtext (added)
+++ mahout/site/trunk/content/spectralclustering.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,262 @@
+Title: Spectral Clustering
+Spectral clustering, a more powerful and specialized algorithm (compared to
+Kmeans), derives its name from spectral analysis of a graph, which is how
+the data are represented. Each object to be clustered can initially be
+represented as an _n_\dimensional numeric vector, but the difference with
+this algorithm is that there must also be some method for performing a
+comparison between each object and expressing this comparison as a scalar.
+
+This _n_ by _n_ comparison of all objects with all others forms the
+_affinity_ matrix, which can be intuitively thought of as a rough
+representation of an underlying undirected, weighted, and fullyconnected
+graph whose edges express the relative relationships, or affinities,
+between each pair of objects in the original data. This affinity matrix
+forms the basis from which the two spectral clustering algorithms operate.
+
+The equation by which the affinities are calculated can vary depending on
+the user's circumstances; typically, the equation takes the form of:
+
+exp( _d^2_ / _c_ )
+
+where _d_ is the Euclidean distance between a pair of points, and _c_ is a
+scaling factor. _c_ is often calculated relative to a _k_\neighborhood of
+closest points to the current point; all other affinities are set to 0
+outside of the neighborhood. Again, this formula can vary depending on the
+situation (e.g. a fullyconnected graph would ignore the _k_\neighborhood
+and calculate affinities for all pairs of points).
+
+[Full overview on spectral clustering](http://spectrallyclustered.wordpress.com/2010/05/27/introandspectralclustering101/)
+
+<a name="SpectralClusteringKMeansSpectralClustering"></a>
+## KMeans Spectral Clustering
+
+<a name="SpectralClusteringOverview"></a>
+### Overview
+
+This consists of a few basic steps of generalized spectral clustering,
+followed by standard kmeans clustering over the intermediate results.
+Again, this process begins with an affinity matrix *A* \ whether or not it
+is fullyconnected depends on the user's need.
+
+*A* is then transformed into a pseudoLaplacian matrix via a multiplication
+with a diagonal matrix whose entries consist of the sums of the rows of
+*A*. The sums are modified to be the inverse square root of their original
+values. The final operation looks something like:
+
+L = D^{1/2} A D^{1/2}
+
+*L* has some properties that are of interest to us; most importantly, while
+it is symmetric like *A*, it has a more stable eigendecomposition. *L* is
+decomposed into its constituent eigenvectors and corresponding eigenvalues
+(though the latter will not be needed for future calculations); the matrix
+of eigenvectors, *U*, is what we are now interested in.
+
+Assuming *U* is a column matrix (the eigenvectors comprise the columns),
+then we will now use the _rows_ of *U* as proxy data for the original data
+points. We will run each row through standard Kmeans clustering, and the
+label that each proxy point receives will be transparently assigned to the
+corresponding original data point, resulting in the final clustering
+assignments.
+
+[Full overview on kmeans spectral clustering](http://spectrallyclustered.wordpress.com/2010/06/05/sprint1kmeansspectralclustering/)
+
+<a name="SpectralClusteringImplementation"></a>
+### Implementation
+
+The Mahout implementation consists of a single driver 
+SpectralKMeansDriver  calling upon several common utilities. The driver
+performs the operations in sequential fashion: reading in and constructing
+the affinity matrix, building the diagonal matrix, building the
+pseudoLaplacian and decomposing it, and clustering the components.
+
+The affinity matrix input is the most important part of this algorithm. It
+consists of text files which follow a specific format: that of a weighted,
+undirected graph. In order to represent a graph in text files, each line of
+a text file represents a single directional edge between two nodes. There
+are three commaseparated values on the line. The first number indicates
+the source node, the second is the destination node, and the third is the
+weight. For example:
+
+0, 1, 2.5
+
+would indicate the directional edge from node 0 to node 1 has a weight of
+2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix
+is symmetric, hence there should be a corresponding line in the text file
+of: 1, 0, 2.5.* Also, each node should be an integer value.
+
+*note*: per MAHOUT978 don't name your input file with leading '_' (or
+'.').
+
+M/R jobs written for SpectralKMeans:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+
+M/R jobs already in Mahout that were used:
+* DistributedRowMatrix.transpose()
+* DistributedLanczosSolver
+* EigenVerfierJob
+* KMeansDriver
+
+<a name="SpectralClusteringEigencutsSpectralClustering"></a>
+## Eigencuts Spectral Clustering
+
+<a name="SpectralClusteringOverview"></a>
+### Overview
+
+Intuitively, Eigencuts can be thought of as part 2 of the Kmeans
+algorithm, in that it performs the same initial steps up until the kmeans
+clustering. The algorithm uses the same affinity matrix *A*, constructs the
+same diagonal matrix *D*, performs the same multiplication to create the
+pseudoLaplacian *L*, and conducts an eigendecomposition on *L* to obtain
+the eigenvectors and eigenvalues. But here is where the two algorithms
+differentiate.
+
+For each eigenvector, we wish to determine how stable its flow of
+probability is within the underlying graph of the original data.
+Intuitively, this is intrinsically related to the mincut, maxflow problem
+of finding bottlenecks: if we perturb the flow rate on a specific edge, and
+overall the flow is stable, then we can conclude that this edge was not a
+bottleneck. If, however, perturbing an edge significantly alters the
+overall flow, we know this edge's eigenflow is very unstable and is a
+bottleneck.
+
+We have an [explicit form](http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png)
+ of this "sensitivity" calculation ([full post here, under "computing
+sensitivities"http://spectrallyclustered.wordpress.com/2010/07/15/sprint3threelastmrtasks/]).
+The next step is called "nonmaximal suppression", which effectively means
+we will ignore any of the calculated sensitivities for which there exists
+another more negative sensitivity in the same neighborhood as the current
+one, effectively "suppressing" it.
+
+This nonmaximal suppression then plays a role in the final
+affinitycutting step, where we "cut" the affinity (set to 0) between any
+two nodes (effectively destroying the edge between them) for which the
+sensitivity calculated at that edge passes some threshold, _and_ for which
+the sensitivity was _not_ suppressed in the previous step.
+
+Once the cutting has been completed, the process loops upon itself
+(starting with the recalculation of *D* using the modified *A*) until no
+new cuts in *A* are made in the final step.
+
+[Full overview on Eigencuts spectral clustering](http://spectrallyclustered.wordpress.com/2010/07/06/sprint3introductiontoeigencuts/)
+
+<a name="SpectralClusteringImplementation"></a>
+### Implementation
+
+Since the first half of Eigencuts uses the same calculations as Spectral
+Kmeans, it uses the same common M/R tasks, both those specific to spectral
+clustering, as well as those general to Mahout. Unlike SpectralKMeans,
+however, there are no DistributedRowMatrixspecific operations performed,
+and hence there is no need for the data type at all; Mahout Vectors are
+used heavily instead.
+
+Once the initial affinity matrix is constructed, there is a loop within the
+EigencutsDriver over the calculation of *D*, the creation of *L* and its
+eigendecomposition, the calculation of the sensitivities, and the actual
+affinity cuts, such that the loop terminates only when no cuts are made to
+the affinity matrix *A*. The final matrix *A* will then be representative
+of a graph structure whose data points exist in intraconnected clusters.
+
+M/R tasks specific to Eigencuts:
+* EigencutsSensitivityJob (calculates the perturbation effects on edge
+weights)
+* EigencutsAffinityCutsJob (sets edge weights to 0)
+
+M/R tasks within spectral clustering:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+
+M/R tasks general to Mahout:
+* DistributedLanczosSolver
+* EigenVerifierJob
+
+<a name="SpectralClusteringQuickstart"></a>
+## Quickstart
+
+As noted before, the data for both these algorithms  the affinity matrix 
+is required to be symmetric. As of the first release, there is no builtin
+mechanism in Mahout for generating the affinities from raw data, as the
+formula this follows varies depending on the user's need, so it is left as
+an exercise to the user to generate the affinities prior to using these
+algorithms.
+
+The affinity input should follow the standard format for textual
+representation of a graph, namely:
+
+node_i, node_j, value_ij
+
+For example, the following 3x3 affinity matrix:
+
+<table>
+<tr><td> 0.0 </td><td> 0.8 </td><td> 0.5 </td></tr>
+<tr><td> 0.8 </td><td> 0.0 </td><td> 0.9 </td></tr>
+<tr><td> 0.5 </td><td> 0.9 </td><td> 0.0 </td></tr>
+</table>
+
+would have the following textual input format:
+
+0, 0, 0
+0, 1, 0.8
+0, 2, 0.5
+1, 0, 0.8
+1, 1, 0
+1, 2, 0.9
+2, 0, 0.5
+2, 1, 0.9
+2, 2, 0
+
+Then simply invoke Spectral Kmeans or Eigencuts, using the input directory
+and setting the other parameters as necessary (e.g. "k" for Kmeans, "beta"
+for Eigencuts, etc).
+
+Format *rules*: node count begins at zero, is continuous. avoid whitespace,
+comments etc.
+
+<a name="SpectralClusteringExamples"></a>
+## Examples
+
+*NOTE: Am still waiting for Carnegie Mellon/Univ. of Pittsburgh approval
+for official data set*
+
+For these algorithms, it is useful to have a viable example, so I have
+created a small but effective synthetic data set to show how these
+algorithms operate. The raw data was generated synthetically and [can be viewed here](http://dl.dropbox.com/u/1377610/rawdata.csv)
+. It consists of 450 twodimensional points drawn from 3 separate gaussian
+distributions their own means and standard deviations ([here is an image of
+the plotted
+pointshttp://spectrallyclustered.files.wordpress.com/2010/07/clusters.png].
+This same data in affinity matrix form looks like this: [view the affinity data sethttp://dl.dropbox.com/u/1377610/affinity.csv]
+
+In order to run the program, then, for spectral kmeans:
+
+bin/mahout spectralkmeans \i /path/to/directory/with/affinity/matrix \o
+/output/path \k 3 \d 450
+
+and for eigencuts:
+
+bin/mahout eigencuts \i /path/to/directory/with/affinity/matrix \o
+/output/path \b 2 \d 450
+
+In both cases, the "d" flag refers to the dimensionality of the affinity
+matrix; since there are 450 points, this value will be 450. In spectral
+kmeans, the "k" is the number of clusters to estimate. In Eigencuts, the
+"b" refers to an initial estimate on the halflife of the flow of
+probability in the graph; higher estimates correspond to fewer clusters.
+
+Spectral kmeans will eventually yield the clustering results, and there
+should only be a single mistake: point 54. This is because that particular
+point is in between the two lefthand clusters in the image, and so has
+high affinities with both clusters.
+
+[Full overview of example here](http://spectrallyclustered.wordpress.com/2010/07/14/sprint3quickupdate/)
+
+<a name="SpectralClusteringResources"></a>
+### Resources
+
+* [http://spectrallyclustered.wordpress.com/2010/05/27/introandspectralclustering101/](http://spectrallyclustered.wordpress.com/2010/05/27/introandspectralclustering101/)
+* [http://www.stanford.edu/class/ee378B/papers/luxburgspectral.pdf](http://www.stanford.edu/class/ee378B/papers/luxburgspectral.pdf)
+* [http://en.wikipedia.org/wiki/Laplacian_matrix](http://en.wikipedia.org/wiki/Laplacian_matrix)
+* [http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering](http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering)
Added: mahout/site/trunk/content/stochasticsingularvaluedecomposition.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/stochasticsingularvaluedecomposition.mdtext?rev=1360593&view=auto
==============================================================================
 mahout/site/trunk/content/stochasticsingularvaluedecomposition.mdtext (added)
+++ mahout/site/trunk/content/stochasticsingularvaluedecomposition.mdtext Thu Jul 12 09:25:54 2012
@@ 0,0 +1,73 @@
+Title: Stochastic Singular Value Decomposition
+Stochastic SVD method in Mahout produces reduced rank Singular Value
+Decomposition output in its strict mathematical definition: A=USV'.
+
+<a name="StochasticSingularValueDecompositionThebenefitsoverothermethodsare:"></a>
+##### The benefits over other methods are:
+* reduced flops required compared to Krylov subspace methods
+* In mapreduce world, a fixed number of MR iterations required regardless
+of rank requested
+* Tweak precision/speed balance with options.
+* A is a Distributed Row Matrix where rows may be identified by any
+Writable (such as a document path). As such, it would work directly on the
+output of seq2sparse.
+* As of 0.7 trunk, includes PCA and dimensionality reduction workflow
+(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is
+greatly appreciated.)
+
+mapreduce characteristics:
+SSVD uses at most 3 MR _sequential_ steps (maponly + mapreduce + 2
+optional parallel mapreduce jobs) to produce reduced rank approximation of
+U, V and S matrices. Additionally, two more mapreduce steps are added for
+each power iteration step if requested.
+
+<a name="StochasticSingularValueDecompositionPotentialdrawbacks:"></a>
+##### Potential drawbacks:
+* potentially less precise (but adding even one power iteration seems to
+fix that quite a bit).
+
+<a name="StochasticSingularValueDecompositionDocumentation"></a>
+##### Documentation
+
+[Overview and Usage](^ssvdcli.pdf.html)
+Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.
+
+<a name="StochasticSingularValueDecompositionPublications"></a>
+##### Publications
+
+[Nathan Halko's dissertation](http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf)
+ "Randomized methods for computing lowrank
+approximations of matrices" contains comprehensive definition of
+parallelization strategy taken in Mahout SSVD implementation and also some
+precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation
+on a typical corpus data set.
+
+<a name="StochasticSingularValueDecompositionRsimulation"></a>
+##### R simulation
+[Nonparallel SSVD simulation in R with power iterations and PCA options](^ssvd.r.html)
+. Note that this implementation is not most optimal for sequential flow
+solver, but it is for demonstration purposes only.
+
+However, try this R code to simulate a meaningful input:
+<DIV class="code panel" style="borderstyle: solid;borderwidth: 1px;"><DIV class="codeHeader panelHeader" style="borderbottomwidth: 1px;borderbottomstyle: solid;"><B>tests.R</B></DIV><DIV class="codeContent panelContent">
+ n<1000
+ m<2000
+ k<10
+
+ qi<1
+
+ #simulated input
+ svalsim<diag(k:1)
+
+ usim< qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
+ vsim< qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
+
+
+ x< usim %*% svalsim %*% t(vsim)
+
+
+
+and try to compare ssvd.svd\(x\) and stock svd\(x\) performance for the
+same rank k, notice the difference in the running time. Also play with
+power iterations (qIter) and compare accuracies of standard svd and SSVD.
+
