mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sro...@apache.org
Subject svn commit: r1360593 [15/17] - in /mahout/site/trunk: ./ cgi-bin/ content/ content/attachments/ content/attachments/101992/ content/attachments/116559/ content/attachments/22872433/ content/attachments/22872443/ content/attachments/23335706/ content/at...
Date Thu, 12 Jul 2012 09:26:03 GMT
Added: mahout/site/trunk/content/quickstart.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/quickstart.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/quickstart.mdtext (added)
+++ mahout/site/trunk/content/quickstart.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,63 @@
+Title: Quickstart
+<a name="Quickstart-DownloadandInstallation"></a>
+# Download and Installation
+
+* [Download and installation](buildingmahout.html)
+
+<a name="Quickstart-RunningtheExamples"></a>
+# Running the Examples
+
+Mahout runs on [Apache Hadoop ](http://hadoop.apache.org.html)
+. So, to run these examples, install the latest compatible [^1] [^2] [Hadoop Common release](http://hadoop.apache.org/common/)
+
+[^1]: When using a release, see the release notes
+[^2]: When using trunk,see <i>parent/pom.xml</i>
+
+<a name="Quickstart-Clustering"></a>
+## Clustering
+
+* [Clustering of synthetic control data](clustering-of-synthetic-control-data.html)
+* [Visualizing Sample Clusters](visualizing-sample-clusters.html)
+* [Clustering Seinfeld Episodes](clustering-seinfeld-episodes.html)
+
+
+<a name="Quickstart-Classification"></a>
+## Classification
+
+* [Twenty Newsgroups](twenty-newsgroups.html)
+* [Wikipedia Bayes Example](wikipedia-bayes-example.html)
+
+<a name="Quickstart-GeneticProgramming"></a>
+## Genetic Programming
+
+* [Watchmaker](mahout.ga.tutorial.html)
+
+<a name="Quickstart-DecisionForest"></a>
+## Decision Forest
+
+* [Breiman Example](breiman-example.html)
+* [Partial Implementation](partial-implementation.html)
+
+<a name="Quickstart-Recommendationmining"></a>
+## Recommendation mining
+
+This package comes with four examples based on netflix data, bookcrossing,
+grouplens and jester data.
+
+* [RecommendationExamples](recommendationexamples.html)
+
+<a name="Quickstart-WheretoGoNext"></a>
+# Where to Go Next
+
+* If you are working with text, read more on [Creating Vectors from Text](creating-vectors-from-text.html)
+* To learn more on grouping items by similarity and identifying clusters
+read more on [Clustering your data](clusteringyourdata.html)
+* If you want to classify incoming items into predefinied categories read
+more on [Classifying your data ](classifyingyourdata.html)
+* To know how to Mine frequent patterns from your data read more on [Parallel Frequent Pattern Mining](parallel-frequent-pattern-mining.html)
+* To read more on building recommendation engines have a look at the [Taste documentation ](http://lucene.apache.org/mahout/taste.html)
+ and [Taste hadoop commandline|TasteCommandLine]
+
+Go back to the [Main Page](mahout-wiki.html)
+ for more information. 
+

Added: mahout/site/trunk/content/random-forests.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/random-forests.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/random-forests.mdtext (added)
+++ mahout/site/trunk/content/random-forests.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,228 @@
+Title: Random Forests
+<a name="RandomForests-HowtogrowaDecisionTree"></a>
+### How to grow a Decision Tree
+
+source : \[3\](3\.html)
+
+LearnUnprunedTree(*X*,*Y*)
+
+Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
+the value of the *j*'th attribute in the *i*'th input datapoint. Each
+column consists of either all real values or all categorical values.
+Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
+class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
+Output: An Unpruned decision tree
+
+
+If all records in *X* have identical values in all their attributes (this
+includes the case where *R<2*), return a Leaf Node predicting the majority
+output, breaking ties randomly. This case also includes
+If all values in *Y* are the same, return a Leaf Node predicting this value
+as the output
+Else
+&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
+&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is
+categorical
+*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information
+Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is
+real-valued)
+*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
+&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
+splitting attribute we'll use)
+&nbsp;&nbsp;&nbsp; If *j\** is categorical then
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th
+attribute
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let
+*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
+Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* =
+LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node,
+splitting on *j*'th attribute. The number of children equals the number of
+values of the *j*'th attribute, and the *v*'th child is
+*Child{*}{*}{^}v{^}*
+&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split
+threshold
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset
+of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
+corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* =
+LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
+in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
+subset of *Y*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* =
+LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on
+*j*'th attribute. It has two children corresponding to whether the *j*'th
+attribute is above or below the given threshold.
+
+*Note*: There are alternatives to Information Gain for splitting nodes
+&nbsp;
+
+<a name="RandomForests-Informationgain"></a>
+### Information gain
+
+source : \[3\](3\.html)
+1. h4. nominal attributes
+
+suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
+P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
+&nbsp;
+H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
+H(Y\|X=v) = the entropy of Y among only those records in which X has value
+v
+H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
+IG(Y\|X) = H(Y) - H(Y\|X)
+1. h4. real-valued attributes
+
+suppose X is real valued
+define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
+define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
+define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)
+
+<a name="RandomForests-HowtogrowaRandomForest"></a>
+### How to grow a Random Forest
+
+source : \[1\](1\.html)
+
+Each tree is grown as follows:
+1. if the number of cases in the training set is *N*, sample *N* cases at
+random \-but with replacement, from the original data. This sample will be
+the training set for the growing tree.
+1. if there are *M* input variables, a number *m << M* is specified such
+that at each node, *m* variables are selected at random out of the *M* and
+the best split on these *m* is used to split the node. The value of *m* is
+held constant during the forest growing.
+1. each tree is grown to its large extent possible. There is no pruning.
+
+<a name="RandomForests-RandomForestparameters"></a>
+### Random Forest parameters
+
+source : \[2\](2\.html)
+Random Forests are easy to use, the only 2 parameters a user of the
+technique has to determine are the number of trees to be used and the
+number of variables (*m*) to be randomly selected from the available set of
+variables.
+Breinman's recommendations are to pick a large number of trees, as well as
+the square root of the number of variables for *m*.
+&nbsp;
+
+<a name="RandomForests-Howtopredictthelabelofacase"></a>
+### How to predict the label of a case
+
+Classify(*node*,*V*)
+&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute
+= j* then the split is done on the *j*'th attribute
+
+&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where
+*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
+&nbsp;&nbsp;&nbsp; Output: label of *V*
+
+&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
+&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted
+by *node*
+
+&nbsp;&nbsp; &nbsp;Else
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j =
+node.attribute*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is
+categorical then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *v* = *V{*}{*}{~}j{~}*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
+value *v*
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)
+
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is
+real-valued
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *t = node.threshold* (split threshold)
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child
+node corresponding to (*<t*)
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}LO{^}*,*V*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Else
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* =
+child node corresponding to (*>=t*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}HI{^}*,*V*)
+&nbsp;
+
+<a name="RandomForests-Theoutofbag(oob)errorestimation"></a>
+### The out of bag (oob) error estimation
+
+source : \[1\](1\.html)
+
+in random forests, there is no need for cross-validation or a separate test
+set to get an unbiased estimate of the test set error. It is estimated
+internally, during the run, as follows:
+* each tree is constructed using a different bootstrap sample from the
+original data. About one-third of the cases left of the bootstrap sample
+and not used in the construction of the _kth_ tree.
+* put each case left out in the construction of the _kth_ tree down the
+_kth{_}tree to get a classification. In this way, a test set classification
+is obtained for each case in about one-thrid of the trees. At the end of
+the run, take *j* to be the class that got mort of the the votes every time
+case *n* was _oob_. The proportion of times that *j* is not equal to the
+true class of *n* averaged over all cases is the _oob error estimate_. This
+has proven to be unbiased in many tests.
+
+<a name="RandomForests-OtherRFuses"></a>
+### Other RF uses
+
+source : \[1\](1\.html)
+* variable importance
+* gini importance
+* proximities
+* scaling
+* prototypes
+* missing values replacement for the training set
+* missing values replacement for the test set
+* detecting mislabeled cases
+* detecting outliers
+* detecting novelties
+* unsupervised learning
+* balancing prediction error
+Please refer to \[1\](1\.html)
+ for a detailed description
+
+<a name="RandomForests-References"></a>
+### References
+
+\[1\](1\.html)
+&nbsp; Random Forests - Classification Description
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
+\[2\](2\.html)
+&nbsp; B. Larivière & D. Van Den Poel, 2004. "Predicting Customer Retention
+and Profitability by Using Random Forests and Regression Forests
+Techniques,"
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of
+Economics and Business Administration, Ghent University, Belgium 04/282,
+Ghent University,
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and
+Business Administration.
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : [http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html)
+\[3\](3\.html)
+&nbsp; Decision Trees - Andrew W. Moore\[4\]
+&nbsp; &nbsp; &nbsp; &nbsp; http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html)
+\[4\](4\.html)
+&nbsp; Information Gain - Andrew W. Moore
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials)

Added: mahout/site/trunk/content/recommendationexamples.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/recommendationexamples.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/recommendationexamples.mdtext (added)
+++ mahout/site/trunk/content/recommendationexamples.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,47 @@
+Title: RecommendationExamples
+<a name="RecommendationExamples-Introduction"></a>
+# Introduction 
+
+This quick start page describes how to run the recommendation examples
+provided by Mahout. Mahout comes with four recommendation mining examples.
+They are based on netflixx, jester, grouplens and bookcrossing
+respectively.
+
+<a name="RecommendationExamples-Steps"></a>
+# Steps 
+
+<a name="RecommendationExamples-Testingitononesinglemachine"></a>
+## Testing it on one single machine 
+
+In the examples directory type: 
+
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.bookcrossing.BookCrossingRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.netflix.NetflixRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.netflix.TransposeToByUser"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+
+
+Here, the command line options need only be:
+
+
+    -i [input file]
+
+
+
+Note that the GroupLens example is designed for the "1 million" data set,
+available at http://www.grouplens.org/node/73 . And  the "input file" above
+is the ratings.dat contained in the zipfile from the data set . This file
+has an unusual format and so has a special parser. The example code here
+can be easily modified to use a regular FileDataModel and thus work on more
+standard input, including the other data sets available at this site.

Added: mahout/site/trunk/content/recommender-documentation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/recommender-documentation.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/recommender-documentation.mdtext (added)
+++ mahout/site/trunk/content/recommender-documentation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,394 @@
+Title: Recommender Documentation
+<a name="RecommenderDocumentation-Overview"></a>
+## Overview
+
+_This documentation concerns the non-distributed, non-Hadoop-based
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoop-based code. It may be viewed as a
+somewhat separate, older, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoop-based
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+
+A Mahout-based collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterprise-ready; it's designed for performance, scalability
+and flexibility.
+
+Mahout recommenders are not just for Java; it can be run as an external
+server which exposes recommendation logic to your application via web
+services and HTTP.
+
+Top-level packages define the Mahout interfaces to these key abstractions:
+* DataModel
+* UserSimilarity
+* ItemSimilarity
+* UserNeighborhood
+* Recommender
+
+Subpackages of org.apache.mahout.cf.taste.impl hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! For the academically inclined, Mahout
+supports both *memory-based*, *item-based* recommender systems, *slope one*
+recommenders, and a couple other experimental implementations. It does not
+currently support *model-based* recommenders.
+
+<a name="RecommenderDocumentation-Architecture"></a>
+## Architecture
+
+![taste-architecture](attachments/22872433/23003157.png)
+
+This diagram shows the relationship between various Mahout components in a
+user-based recommender. An item-based recommender system is similar except
+that there are no PreferenceInferrers or Neighborhood algorithms involved.
+
+<a name="RecommenderDocumentation-Recommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+GenericUserBasedRecommender implementation GenericItemBasedRecommender,
+possibly decorated by CachingRecommender.
+
+<a name="RecommenderDocumentation-DataModel"></a>
+### DataModel
+A DataModel is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Mahout provides MySQLJDBCDataModel, for example, to
+access preference data from a database via JDBC and MySQL. Another exists
+for PostgreSQL. Mahout also provides a FileDataModel.
+
+There are no abstractions for a user or item in the object model (not
+anymore). Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A Preference object or PreferenceArray object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+
+Finally, Mahout supports, in various ways, a so-called "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+
+<a name="RecommenderDocumentation-UserSimilarity"></a>
+### UserSimilarity
+A UserSimilarity defines a notion of similarity between two Users. This is
+a crucial part of a recommendation engine. These are attached to a
+Neighborhood implementation. ItemSimilarities are analagous, but find
+similarity between Items.
+
+<a name="RecommenderDocumentation-UserNeighborhood"></a>
+### UserNeighborhood
+In a user-based recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A UserNeighborhood
+defines a means of determining that neighborhood &mdash; for example,
+nearest 10 users. Implementations typically need a UserSimilarity to
+operate.
+
+<a name="RecommenderDocumentation-Requirements"></a>
+## Requirements
+<a name="RecommenderDocumentation-Required"></a>
+### Required
+
+* [Java/ J2SE 6.0](http://www.java.com/getjava/index.jsp)
+
+<a name="RecommenderDocumentation-Optional"></a>
+### Optional
+* [Apache Maven](http://maven.apache.org)
+  2.2.1 or later, if you want to build from source or build examples. (Mac
+users note that even OS X 10.5 ships with Maven 2.0.6, which will not
+work.)
+* Mahout web applications require a [Servlet 2.3+](http://java.sun.com/products/servlet/index.jsp)
+ container, such as [Apache Tomcat|http://jakarta.apache.org/tomcat/]
+. It may in fact work with oldercontainers with slight modification.
+
+<a name="RecommenderDocumentation-Demo"></a>
+## Demo
+
+To build and run the demo, follow the instructions below, which are written
+for Unix-like operating systems:
+
+* Obtain a copy of the Mahout distribution, either from SVN or as a
+downloaded archive.
+* Download the "1 Million MovieLens Dataset" from [Grouplens.org](http://www.grouplens.org/)
+* Unpack the archive and copy movies.dat and ratings.dat to
+trunk/integration/src/main/resources/org/apache/mahout/cf/taste/example/grouplens
+under the Mahout distribution directory.
+* Navigate to the directory where you unpacked the Mahout distribution, and
+navigate to trunk.
+* Run mvn -DskipTests install, which builds and installs Mahout core to
+your local repository
+* cd integration
+* You may need to give Maven more memory: in a bash shell, export
+MAVEN_OPTS=-Xmx1024M
+* mvn jetty:run.
+* Get recommendations by accessing the web application in your browser:
+http://localhost:8080/mahout-integration/RecommenderServlet?userID=1 This
+will produce a simple preference-item ID list which could be consumed by a
+client application. Get more useful human-readable output with the debug
+parameter:
+http://localhost:8080/mahout-integration/RecommenderServlet?userID=1&debug=true
+
+
+<a name="RecommenderDocumentation-Examples"></a>
+## Examples
+<a name="RecommenderDocumentation-User-basedRecommender"></a>
+### User-based Recommender
+User-based recommenders are the "original", conventional style of
+recommender system. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+
+First, create a DataModel of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+
+We'll use the PearsonCorrelationSimilarity implementation of UserSimilarity
+as our user correlation algorithm, and add an optional preference inference
+algorithm:
+
+
+    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+    // Optional:
+    userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
+
+
+Now we create a UserNeighborhood algorithm. Here we use nearest-3:
+
+
+    UserNeighborhood neighborhood =
+    	  new NearestNUserNeighborhood(3, userSimilarity, model);
+    
+    Now we can create our Recommender, and add a caching decorator:
+    
+
+Recommender recommender =
+	  new GenericUserBasedRecommender(model, neighborhood,
+userSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+
+    
+    Now we can get 10 recommendations for user ID "1234" &mdash; done!
+
+List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+    
+    h3.Item-based Recommender
+    
+    We could have created an item-based recommender instead. Item-based
+recommender base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, item-based
+recommenders can use pre-computed similarity values in the computations,
+which make them much faster. For large data sets, item-based recommenders
+are more appropriate.
+    
+    Let's start over, again with a FileDataModel to start:
+    
+
+DataModel model = new FileDataModel(new File("data.txt"));
+
+    
+    We'll also need an ItemSimilarity. We could use
+PearsonCorrelationSimilarity, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of pre-computed correlations to a
+GenericItemSimilarity: 
+    
+
+// Construct the list of pre-computed correlations
+Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+	  ...;
+ItemSimilarity itemSimilarity =
+	  new GenericItemSimilarity(correlations);
+
+
+    
+    Then we can finish as before to produce recommendations:
+    
+<pre><code>
+Recommender recommender =
+	  new GenericItemBasedRecommender(model, itemSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+...
+List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);<code></pre>
+
+    
+    h3. Slope-One Recommender
+    This is a simple yet effective Recommender and we present another example
+to round out the list:
+    
+<pre><code>
+DataModel model = new FileDataModel(new File("data.txt"));
+	  // Make a weighted slope one recommender
+	  Recommender recommender = new SlopeOneRecommender(model);
+	  Recommender cachingRecommender = new
+CachingRecommender(recommender);</code></pre>
+	
+
+
+    
+<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
+## Integration with your application
+<a name="RecommenderDocumentation-Direct"></a>
+### Direct
+
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+
+<a name="RecommenderDocumentation-Standaloneserver"></a>
+### Standalone server
+A Mahout recommender can also be run as an external server, which may be
+the only option for non-Java applications. It can be exposed as a web
+application via org.apach.mahout.cf.taste.web.RecommenderServlet, and your
+application can then access recommendations via simple HTTP requests and
+response. See above, and see the javadoc for details.
+
+<a name="RecommenderDocumentation-Performance"></a>
+## Performance
+<a name="RecommenderDocumentation-RuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following command-line flags to
+your JVM:
+
+* -server: Enables the server VM, which is generally appropriate for
+long-running, computation-intensive applications.
+* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* -da -dsa: Disable all assertions.
+* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and third-party code you use doesn't call this method.
+
+Also consider the following tips:
+
+* Use CachingRecommender on top of your custom Recommender implementation.
+* When using JDBCDataModel, make sure you've taken basic steps to optimize
+the table storing preference data. Create a primary key on the user ID and
+item ID columns, and an index on them. Set them to be non-null. And so on.
+Tune your database for lots of concurrent reads! When using JDBC, the
+database is almost always the bottleneck. Plenty of memory and caching are
+even more important.
+* Also, pooling database connections is essential to performance. If using
+a J2EE container, it probably provides a way to configure connection pools.
+If you are creating your own DataSource directly, try wrapping it in
+org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
+* See MySQL-specific notes on performance in the javadoc for
+MySQLJDBCDataModel.
+
+<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trial-and-error to find a setup that works best. The code sample above
+provides a good starting point.
+
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in org.apache.mahout.cf.taste.eval"
+
+
+    DataModel myModel = ...;
+    RecommenderBuilder builder = new RecommenderBuilder() {
+      public Recommender buildRecommender(DataModel model) {
+        // build and return the Recommender to evaluate here
+      }
+    };
+    RecommenderEvaluator evaluator =
+    	  new AverageAbsoluteDifferenceRecommenderEvaluator();
+    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+
+
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try this kind of evaluation, which presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful:
+
+
+    ...
+    RecommenderIRStatsEvaluator evaluator =
+    	new GenericRecommenderIRStatsEvaluator();
+    IRStatistics stats =
+    	evaluator.evaluate(builder, null, myModel, null, 3,
+    RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
+    	&sect;1.0);
+
+
+
+<a name="RecommenderDocumentation-UsefulLinks"></a>
+## Useful Links
+You'll want to look at these packages too, which offer more algorithms and
+approaches that you may find useful:
+
+* [Cofi](http://www.nongnu.org/cofi/)
+: A Java-Based Collaborative Filtering Library
+* [CoFE](http://eecs.oregonstate.edu/iis/CoFE/)
+
+Here's a handful of research papers that I've read and found particularly
+useful:
+
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for Collaborative Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation algorithms](http://www10.org/cdrom/papers/519/)
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285-295, 2001.
+
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture for collaborative filtering of netnews](http://doi.acm.org/10.1145/192844.192905)
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175-186, 1994.
+
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing collaborative filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
+1999.
+
+Clifford Lyon, "[Movie Recommender](http://materialobjects.com/cf/MovieRecommender.pdf)
+" CSCI E-280 final project, Harvard University, 2004.
+
+Daniel Lemire, Anna Maclachlan, "[Slope One Predictors for Online Rating-Based Collaborative Filtering](http://www.daniel-lemire.com/fr/abstracts/SDM2005.html)
+," Proceedings of SIAM Data Mining (SDM '05), 2005.
+
+Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and Sean McGrath, "[RACOFI: A Rule-Applying Collaborative Filtering System](http://www.daniel-lemire.com/fr/documents/publications/racofi_nrc.pdf)
+"," Proceedings of COLA '03, 2003.
+
+These links will take you to all the collaborative filtering reading you
+could ever want!
+* [Paul Perry's notes](http://www.paulperry.net/notes/cf.asp)
+* [James Thornton's collaborative filtering resources](http://jamesthornton.com/cf/)
+* [Daniel Lemire's blog](http://www.daniel-lemire.com/blog/)
+ which frequently covers collaborative filtering topics

Added: mahout/site/trunk/content/recommender-first-timer-faq.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/recommender-first-timer-faq.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/recommender-first-timer-faq.mdtext (added)
+++ mahout/site/trunk/content/recommender-first-timer-faq.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,50 @@
+Title: Recommender First-Timer FAQ
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
+newcomers.
+
+For the interested, these topics are treated in detail in the book [Mahout in Action](http://manning.com/owen/)
+.
+
+Don't start with a distributed, Hadoop-based recommender; take on that
+complexity only if necessary. Start with non-distributed recommenders. It
+is simpler, has fewer requirements, and is more flexible. 
+
+As a crude rule of thumb, a system with up to 100M user-item associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a real-time recommender. The system
+is invariably memory-bound since keeping data in memory is essential to
+performance.
+
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+
+If you want to do content-based item-item similarity, you need to implement
+your own ItemSimilarity.
+
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically -- perhaps on the order of minutes. If that's not OK,
+you'll have to look into some more specialized work -- SlopeOneRecommender
+deals with updates quickly, or, it is possible to do some work to update
+the GenericDataModel in real time. 

Added: mahout/site/trunk/content/reference-reading.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/reference-reading.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/reference-reading.mdtext (added)
+++ mahout/site/trunk/content/reference-reading.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,216 @@
+Title: Reference Reading
+<a name="ReferenceReading-GeneralClustering"></a>
+# General Clustering
+
+<a name="ReferenceReading-Discussions"></a>
+## Discussions
+
+* [clustering tips and tricks ](http://www.lucidimagination.com/search/document/1c3561d17fc1b81c/clustering_techniques_tips_and_tricks)
+
+<a name="ReferenceReading-TextClustering"></a>
+# Text Clustering
+
+<a name="ReferenceReading-ClusteringaspartofSearch"></a>
+## Clustering as part of Search
+
+* See Chapters on Hierarchical and Flat Clustering as part of search in the [stanford ir book](http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html)
+
+<a name="ReferenceReading-GeneralBackgroundMaterials"></a>
+# General Background Materials
+
+[Q:](http://mail-archives.apache.org/mod_mbox/mahout-user/201103.mbox/%3CAANLkTi=c8kGHjvbfTCt0GyjKvvMCq=EaFsCr=_s=ykDE@mail.gmail.com%3E)
+Can someone recommend me good books on Statistics and also on Linear
+Algebra
+and Analytic Geometry which will provide enough background for
+understanding
+machine learning algorithms?
+
+<a name="ReferenceReading-"></a>
+## 
+
+The answers below focus on general background knowledge, rather than
+specifics of Mahout and associated Apache tooling. Feel free to add useful
+resources (books, but also videos, online courseware, tools), particularly
+those that are available free online.
+
+This page originated in an email thread, and its different contributors
+might not all agree on the best approach (and they might not know what's
+best for any given learner), but the resources here should give some idea
+of suitable background reading. Check the mailing list [archives](http://mail-archives.apache.org/mod_mbox/mahout-user/)
+ if you care to figure out who-said-what, or find other suggestions.
+
+Don't be overwhelmed by all the maths, you can do a lot in Mahout with some
+basic knowledge. The resources given here will help you understand your
+data better, and ask better questions both of Mahout's APIs, and also of
+the Mahout community. And unlike learning some particular software tool,
+these are skills that will remain useful decades later.
+
+h3. Books and supporting materials on statistics, machine learning,
+matrices etc.:
+
+[Gilbert Strang](http://www-math.mit.edu/~gs)
+'s [Introduction to Linear Algebra|http://math.mit.edu/linearalgebra/]
+ (*full text* online, highly recommended by several on the mahout list).
+([openlibrary](http://openlibrary.org/works/OL3285486W/Introduction_to_linear_algebra)
+)
+His lectures are also [available online](http://web.mit.edu/18.06/www/)
+ and are strongly recommended. See [http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/]
+
+
+"Mathematical Tools for Applied Mulitvariate Analysis" by J.Douglass
+Carroll.
+([amazon](http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=8-1)
+)
+
+
+
+[Stanford Machine Learning online courseware](http://www.stanford.edu/class/cs229/)
+(cs229.stanford.edu):
+
+"It's a very nicely taught course with super helpful lecture notes - and
+you can get all the videos in youtube or [iTunesU](http://itunes.apple.com/itunes-u/machine-learning/id384233048)
+"
+
+"The [section notes](http://www.stanford.edu/class/cs229/materials.html)
+ for this course will give you enough review material on linear algebra and
+probability theory to get you going."
+
+[MIT Machine Learning online courseware](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/)
+ (6.867) has [Lecture notes in PDF|http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/]
+ online.
+
+As a pre-requisite to probability and statistics, you'll need [basic calculus](http://en.wikipedia.org/wiki/Calculus)
+.  A maths for scientists text might be useful here such as 'Mathematics
+for Engineers and Scientists', Alan Jeffrey, Chapman & Hall/CRC.
+([openlibrary](http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists)
+)
+
+
+One of the best writers in the probability/statistics world is Sheldon
+Ross.  Try
+''A First Course in Probability (8th Edition), Pearson'' ([amazon](http://www.pearsonhighered.com/educator/product/First-Course-in-Probability-A/9780136033134.page)
+) and then move on to his ''Introduction to Probability Models (9th
+Edition), Academic
+Press.''([amazon|http://www.amazon.com/Introduction-Probability-Models-Sixth-Sheldon/dp/0125984707])
+
+
+Some good introductory alternatives here are:
+
+[Kahn Academy](http://www.khanacademy.org/)
+ -- videos on stats, probability, linear algebra
+
+Probability and Statistics (7th Edition), Jay L. Devore, Chapman.
+([amazon](http://www.amazon.com/Probability-Statistics-Engineering-Sciences-InfoTrac/dp/0534399339)
+)
+
+Probability and Statistical Inference (7th Edition), Hogg and Tanis,
+Pearson.
+([amazon](http://www.amazon.com/Probability-Statistical-Inference-Robert-Hogg/dp/0132546086)
+)
+
+Once you have a grasp of the basics then there are a slew of great texts
+that you might consult:  for example,
+
+Statistical Inference,	Casell and Berger, Duxbury/Thomson Learning.
+([amazon](http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126)
+)
+
+Most statistics books will have some sort of introduction to Bayesian
+methods, consider a specialist text, e.g.:
+
+Introduction to Bayesian Statistics (2nd Edition), William H. Bolstad,
+Wiley.
+([amazon](http://www.amazon.com/Introduction-Bayesian-Statistics-William-Bolstad/dp/0471270202)
+)
+
+Then for the computational side of Bayesian (predominantly Markov chain
+Monte Carlo), e.g.
+Bolstad's Understanding Computational Bayesian Statistics, Wiley.
+([amazon](http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090)
+)
+
+Then you might try [Bayesian Data Analysis, Gelman et al., Chapman &Hall/CRC](http://www.stat.columbia.edu/~gelman/book/)
+
+On top of the books, [R](http://en.wikipedia.org/wiki/R_(programming_language))
+ \- is an indispensable software tool for visualizing distributions and
+doing calculations
+
+
+
+(another viewpoint)
+
+For statistics related to machine learning, I would avoid normal
+statistical texts and go with these instead
+
+[Pattern Recognition and Machine Learning by Chris Bishop](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm)
+
+[Elements of Statistical Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/)
+ by Trevor Hastie, Robert Tibshirani, Jerome Friedman 
+
+Also [http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm)
+ 
+matrix computations/decomposition/factorization etc.?
+
+[How's this one?](http://www.amazon.com/gp/product/0801854148/ref=s9_simh_gw_p14_d0_i1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-3&pf_rd_r=0ESQ3KDY8MJ1AWWG8PFR&pf_rd_t=101&pf_rd_p=470938811&pf_rd_i=507846)
+any idea? any other suggestion?
+
+I found the one by Peter V. O'Neil "Introduction to Linear Algebra", to be
+a great book for beginners
+(with some knowledge in calculus). It is not comprehensive, but, I believe,
+it will be a good place to start and the author starts by explaining the
+concepts with regards to vector spaces which I found to be a more natural
+way of explaining.[http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X](http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X)
+
+David S. Watkins "Fundamentals of Matrix Computations (Pure and Applied
+Mathematics: A Wiley Series of Texts, Monographs and Tracts)"
+[http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/](http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/)
+
+
+
+The Gollub / Van Loan text you mention is the classic text for numerical
+linear algebra.  Can't go wrong with it.  However, I'd also suggest you
+look
+at Nick Trefethen's "Numerical Linear Algebra".  It's a bit more
+approachable for practitioners -- GVL is better suited for researchers.
+[http://people.maths.ox.ac.uk/trefethen/books.html](http://people.maths.ox.ac.uk/trefethen/books.html)
+[http://people.maths.ox.ac.uk/trefethen/text.html](http://people.maths.ox.ac.uk/trefethen/text.html)
+ (with some online lecture notes)
+
+
+I think this is the most relevant book for matrix math on distributed
+systems:
+
+[http://www.amazon.com/Numerical-Linear-Algebra-Lloyd-Trefethen/dp/0898713617](http://www.amazon.com/Numerical-Linear-Algebra-Lloyd-Trefethen/dp/0898713617)
+Many chapters on SVD, there are even chapters on Lanczos
+
+
+BTW what about R? There is literally tons of books in R series devoted
+to rather isolated problems but what would be a good crush course
+book?
+
+
+Ted Dunning:
+
+"I have found that learning about R is a difficult thing.  The best
+introduction I have seen is, paradoxically, not really a book about R and
+assumes a statistical mind-set that I disagree with.  That introduction is
+in MASS [http://www.stats.ox.ac.uk/pub/MASS4/](http://www.stats.ox.ac.uk/pub/MASS4/)
+.  Other references also
+exist:
+
+[http://www.r-tutor.com/r-introduction](http://www.r-tutor.com/r-introduction)
+[http://cran.r-project.org/doc/manuals/R-intro.pdf](http://cran.r-project.org/doc/manuals/R-intro.pdf)
+[http://faculty.washington.edu/tlumley/Rcourse/](http://faculty.washington.edu/tlumley/Rcourse/)
+
+In addition, you should see how to plot data well:
+
+[http://www.statmethods.net/advgraphs/trellis.html](http://www.statmethods.net/advgraphs/trellis.html)
+[http://had.co.nz/ggplot2/](http://had.co.nz/ggplot2/)
+
+Generally, I learn more about R by watching people and reading code than by
+reading books.	There are many small tricks like how to format data
+optimally, how to restructure data.frames, common ways to plot data, which
+libraries do what and so on that an introductory book cannot convey general
+principles that will see you through to success."
+
+For Javascript/Web plotting: [http://www.1stwebdesigner.com/css/top-jquery-chart-libraries-interactive-charts/](http://www.1stwebdesigner.com/css/top-jquery-chart-libraries-interactive-charts/)

Added: mahout/site/trunk/content/restricted-boltzmann-machines.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/restricted-boltzmann-machines.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/restricted-boltzmann-machines.mdtext (added)
+++ mahout/site/trunk/content/restricted-boltzmann-machines.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,43 @@
+Title: Restricted Boltzmann Machines
+NOTE: This implementation is a Work-In-Progress, at least till September,
+2010. 
+
+The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375)
+. 
+
+<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a>
+### Boltzmann Machines
+Boltzmann Machines are a type of stochastic neural networks that closely
+resemble physical processes. They define a network of units with an overall
+energy that is evolved over a period of time, until it reaches thermal
+equilibrium. 
+
+However, the convergence speed of Boltzmann machines that have
+unconstrained connectivity is low.
+
+<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a>
+### Restricted Boltzmann Machines
+Restricted Boltzmann Machines are a variant, that are 'restricted' in the
+sense that connections between hidden units of a single layer are _not_
+allowed. In addition, stacking multiple RBM's is also feasible, with the
+activities of the hidden units forming the base for a higher-level RBM. The
+combination of these two features renders RBM's highly usable for
+parallelization. 
+
+In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
+SVD and k-NN approaches, and contributed immensely to the final solution.
+
+<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a>
+### RBM's in Apache Mahout
+An implementation of Restricted Boltzmann Machines is being developed for
+Apache Mahout as a Google Summer of Code 2010 project. A recommender
+interface will also be provided. The key aims of the implementation are:
+1. Accurate - should replicate known results, including those of the Netflix
+Prize
+1. Fast - The implementation uses Map-Reduce, hence, it should be fast
+1. Scale - Should scale to large datasets, with a design whose critical
+parts don't need a dependency between the amount of memory on your cluster
+systems and the size of your dataset
+
+You can view the patch as it develops [here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm)
+.

Added: mahout/site/trunk/content/rowsimilarityjob.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/rowsimilarityjob.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/rowsimilarityjob.mdtext (added)
+++ mahout/site/trunk/content/rowsimilarityjob.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,35 @@
+Title: RowSimilarityJob
+A brief description of RowSimilarityJob:
+
+(originally from [mailing list](http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser)
+)
+
+The goal is to compute all pairwise similarities between the rows of a
+sparse matrix A.
+
+The computation should be executed in a way that only rows that have at
+least one non-zero value in the same dimension (column) are compared. We
+need this to avoid a quadratic number of pairwise comparisons.
+Furthermore we should be able to 'embed' arbitrary similarity measures
+and we should always be able to use a combiner in all MapReduce steps.
+
+The computation is executed using three MapReduce passes:
+
+In the first step, the rows of A are preprocessed via
+VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
+scaled to unit-length), a single number for each row of A is computed
+via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
+
+The second steps operates on the rows of A' (the columns of A). The
+mapper sums up all pairwise cooccurrences using
+VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
+called 'stripes' pattern). The reducers sums up all cooccurrence vectors
+for one row and uses the similarity measure and the precomputed numbers
+from step one to compute all similarities via
+VectorSimilarityMeasure.similarity().
+
+The third step ensures that only the top k similar rows per row are kept.
+
+It's hard to see from the code but actually the job performs the matrix
+multiplication AA' via outer products with a modified (similarity
+measure specific) dot product.

Added: mahout/site/trunk/content/sample-clusters-animation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/sample-clusters-animation.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/sample-clusters-animation.mdtext (added)
+++ mahout/site/trunk/content/sample-clusters-animation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,53 @@
+Title: Sample Clusters Animation
+<a name="SampleClustersAnimation-DemoAnimation"></a>
+# Demo Animation
+
+This is an animation made from screen caps of all the 
+o.a.m.clustering.display.Display*.java demo apps.
+
+![animation](attachments/27825557/28016730.gif)
+
+All of the programs used the same set of random samples. The ellipses show
+the different clustering algorithms. For more details, see the programs.
+
+_This animation was made with [giftedmotion.jar](http://www.onyxbits.de/giftedmotion)
+. What a border!_
+<a name="SampleClustersAnimation-ScreenCaptures"></a>
+# Screen Captures
+<a name="SampleClustersAnimation-DisplayClustering"></a>
+## DisplayClustering
+The original random dataset with semi-illustrative ellipses.
+
+![Clustering](attachments/27825557/28016732.png)
+
+<a name="SampleClustersAnimation-DisplayCanopy.java"></a>
+## DisplayCanopy.java
+![Canopy](attachments/27825557/28016731.png)
+<a name="SampleClustersAnimation-DisplayKMeans.java"></a>
+## DisplayKMeans.java
+KMeans algorithm with _significance_ set to 5% or better.
+
+![KMeans_5%min](attachments/27825557/28016735.png)
+<a name="SampleClustersAnimation-DisplayDirichlet.java"></a>
+## DisplayDirichlet.java
+Dirichlet Process algorithm, based on a normal distribution, with
+_significance_ set to 5% or better.
+
+![DirichletProcess_NormalDistribution_5%min](attachments/27825557/28016733.png)
+
+<a name="SampleClustersAnimation-DisplayFuzzyKMeans.java"></a>
+## DisplayFuzzyKMeans.java
+
+![FuzzyKMeans_5%min](attachments/27825557/28016734.png)
+<a name="SampleClustersAnimation-DisplayMeanShift.java"></a>
+## DisplayMeanShift.java
+MeanShift variant of Canopy algorithm, with significance set to 2%. (In the
+code it seems like this should be 5%?)
+![MeanShiftCanopy_2%min](attachments/27825557/28016736.png)
+
+<a name="SampleClustersAnimation-DisplaySpectralKMeans.java"></a>
+## DisplaySpectralKMeans.java
+When this works. And when someone wants to redo the animation.
+
+
+

Added: mahout/site/trunk/content/spectral-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/spectral-clustering.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/spectral-clustering.mdtext (added)
+++ mahout/site/trunk/content/spectral-clustering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,262 @@
+Title: Spectral Clustering
+Spectral clustering, a more powerful and specialized algorithm (compared to
+K-means), derives its name from spectral analysis of a graph, which is how
+the data are represented. Each object to be clustered can initially be
+represented as an _n_\-dimensional numeric vector, but the difference with
+this algorithm is that there must also be some method for performing a
+comparison between each object and expressing this comparison as a scalar.
+
+This _n_ by _n_ comparison of all objects with all others forms the
+_affinity_ matrix, which can be intuitively thought of as a rough
+representation of an underlying undirected, weighted, and fully-connected
+graph whose edges express the relative relationships, or affinities,
+between each pair of objects in the original data. This affinity matrix
+forms the basis from which the two spectral clustering algorithms operate.
+
+The equation by which the affinities are calculated can vary depending on
+the user's circumstances; typically, the equation takes the form of:
+
+exp( _-d^2_ / _c_ )
+
+where _d_ is the Euclidean distance between a pair of points, and _c_ is a
+scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
+closest points to the current point; all other affinities are set to 0
+outside of the neighborhood. Again, this formula can vary depending on the
+situation (e.g. a fully-connected graph would ignore the _k_\-neighborhood
+and calculate affinities for all pairs of points).
+
+[Full overview on spectral clustering](http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/)
+
+<a name="SpectralClustering-K-MeansSpectralClustering"></a>
+## K-Means Spectral Clustering
+
+<a name="SpectralClustering-Overview"></a>
+### Overview
+
+This consists of a few basic steps of generalized spectral clustering,
+followed by standard k-means clustering over the intermediate results.
+Again, this process begins with an affinity matrix *A* \- whether or not it
+is fully-connected depends on the user's need.
+
+*A* is then transformed into a pseudo-Laplacian matrix via a multiplication
+with a diagonal matrix whose entries consist of the sums of the rows of
+*A*. The sums are modified to be the inverse square root of their original
+values. The final operation looks something like:
+
+L = D^{-1/2} A D^{-1/2}
+
+*L* has some properties that are of interest to us; most importantly, while
+it is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
+decomposed into its constituent eigenvectors and corresponding eigenvalues
+(though the latter will not be needed for future calculations); the matrix
+of eigenvectors, *U*, is what we are now interested in.
+
+Assuming *U* is a column matrix (the eigenvectors comprise the columns),
+then we will now use the _rows_ of *U* as proxy data for the original data
+points. We will run each row through standard K-means clustering, and the
+label that each proxy point receives will be transparently assigned to the
+corresponding original data point, resulting in the final clustering
+assignments.
+
+[Full overview on k-means spectral clustering](http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/)
+
+<a name="SpectralClustering-Implementation"></a>
+### Implementation
+
+The Mahout implementation consists of a single driver -
+SpectralKMeansDriver - calling upon several common utilities. The driver
+performs the operations in sequential fashion: reading in and constructing
+the affinity matrix, building the diagonal matrix, building the
+pseudo-Laplacian and decomposing it, and clustering the components.
+
+The affinity matrix input is the most important part of this algorithm. It
+consists of text files which follow a specific format: that of a weighted,
+undirected graph. In order to represent a graph in text files, each line of
+a text file represents a single directional edge between two nodes. There
+are three comma-separated values on the line. The first number indicates
+the source node, the second is the destination node, and the third is the
+weight. For example:
+
+0, 1, 2.5
+
+would indicate the directional edge from node 0 to node 1 has a weight of
+2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix
+is symmetric, hence there should be a corresponding line in the text file
+of: 1, 0, 2.5.* Also, each node should be an integer value.
+
+*note*: per MAHOUT-978 don't name your input file with leading '_' (or
+'.').
+
+M/R jobs written for SpectralKMeans:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+
+M/R jobs already in Mahout that were used:
+* DistributedRowMatrix.transpose()
+* DistributedLanczosSolver
+* EigenVerfierJob
+* KMeansDriver
+
+<a name="SpectralClustering-EigencutsSpectralClustering"></a>
+## Eigencuts Spectral Clustering
+
+<a name="SpectralClustering-Overview"></a>
+### Overview
+
+Intuitively, Eigencuts can be thought of as part 2 of the K-means
+algorithm, in that it performs the same initial steps up until the k-means
+clustering. The algorithm uses the same affinity matrix *A*, constructs the
+same diagonal matrix *D*, performs the same multiplication to create the
+pseudo-Laplacian *L*, and conducts an eigen-decomposition on *L* to obtain
+the eigenvectors and eigenvalues. But here is where the two algorithms
+differentiate.
+
+For each eigenvector, we wish to determine how stable its flow of
+probability is within the underlying graph of the original data.
+Intuitively, this is intrinsically related to the min-cut, max-flow problem
+of finding bottlenecks: if we perturb the flow rate on a specific edge, and
+overall the flow is stable, then we can conclude that this edge was not a
+bottleneck. If, however, perturbing an edge significantly alters the
+overall flow, we know this edge's eigenflow is very unstable and is a
+bottleneck.
+
+We have an [explicit form](http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png)
+ of this "sensitivity" calculation ([full post here, under "computing
+sensitivities"|http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/]).
+The next step is called "non-maximal suppression", which effectively means
+we will ignore any of the calculated sensitivities for which there exists
+another more negative sensitivity in the same neighborhood as the current
+one, effectively "suppressing" it.
+
+This non-maximal suppression then plays a role in the final
+affinity-cutting step, where we "cut" the affinity (set to 0) between any
+two nodes (effectively destroying the edge between them) for which the
+sensitivity calculated at that edge passes some threshold, _and_ for which
+the sensitivity was _not_ suppressed in the previous step.
+
+Once the cutting has been completed, the process loops upon itself
+(starting with the recalculation of *D* using the modified *A*) until no
+new cuts in *A* are made in the final step.
+
+[Full overview on Eigencuts spectral clustering](http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/)
+
+<a name="SpectralClustering-Implementation"></a>
+### Implementation
+
+Since the first half of Eigencuts uses the same calculations as Spectral
+K-means, it uses the same common M/R tasks, both those specific to spectral
+clustering, as well as those general to Mahout. Unlike SpectralKMeans,
+however, there are no DistributedRowMatrix-specific operations performed,
+and hence there is no need for the data type at all; Mahout Vectors are
+used heavily instead.
+
+Once the initial affinity matrix is constructed, there is a loop within the
+EigencutsDriver over the calculation of *D*, the creation of *L* and its
+eigen-decomposition, the calculation of the sensitivities, and the actual
+affinity cuts, such that the loop terminates only when no cuts are made to
+the affinity matrix *A*. The final matrix *A* will then be representative
+of a graph structure whose data points exist in intra-connected clusters.
+
+M/R tasks specific to Eigencuts:
+* EigencutsSensitivityJob (calculates the perturbation effects on edge
+weights)
+* EigencutsAffinityCutsJob (sets edge weights to 0)
+
+M/R tasks within spectral clustering:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+
+M/R tasks general to Mahout:
+* DistributedLanczosSolver
+* EigenVerifierJob
+
+<a name="SpectralClustering-Quickstart"></a>
+## Quickstart
+
+As noted before, the data for both these algorithms - the affinity matrix -
+is required to be symmetric. As of the first release, there is no built-in
+mechanism in Mahout for generating the affinities from raw data, as the
+formula this follows varies depending on the user's need, so it is left as
+an exercise to the user to generate the affinities prior to using these
+algorithms.
+
+The affinity input should follow the standard format for textual
+representation of a graph, namely:
+
+node_i, node_j, value_ij
+
+For example, the following 3x3 affinity matrix:
+
+<table>
+<tr><td> 0.0 </td><td> 0.8 </td><td> 0.5 </td></tr>
+<tr><td> 0.8 </td><td> 0.0 </td><td> 0.9 </td></tr>
+<tr><td> 0.5 </td><td> 0.9 </td><td> 0.0 </td></tr>
+</table>
+
+would have the following textual input format:
+
+0, 0, 0
+0, 1, 0.8
+0, 2, 0.5
+1, 0, 0.8
+1, 1, 0
+1, 2, 0.9
+2, 0, 0.5
+2, 1, 0.9
+2, 2, 0
+
+Then simply invoke Spectral K-means or Eigencuts, using the input directory
+and setting the other parameters as necessary (e.g. "k" for K-means, "beta"
+for Eigencuts, etc).
+
+Format *rules*: node count begins at zero, is continuous. avoid whitespace,
+comments etc.
+
+<a name="SpectralClustering-Examples"></a>
+## Examples
+
+*NOTE: Am still waiting for Carnegie Mellon/Univ. of Pittsburgh approval
+for official data set*
+
+For these algorithms, it is useful to have a viable example, so I have
+created a small but effective synthetic data set to show how these
+algorithms operate. The raw data was generated synthetically and [can be viewed here](http://dl.dropbox.com/u/1377610/rawdata.csv)
+. It consists of 450 two-dimensional points drawn from 3 separate gaussian
+distributions their own means and standard deviations ([here is an image of
+the plotted
+points|http://spectrallyclustered.files.wordpress.com/2010/07/clusters.png].
+This same data in affinity matrix form looks like this: [view the affinity data set|http://dl.dropbox.com/u/1377610/affinity.csv]
+
+In order to run the program, then, for spectral k-means:
+
+bin/mahout spectralkmeans \-i /path/to/directory/with/affinity/matrix \-o
+/output/path \-k 3 \-d 450
+
+and for eigencuts:
+
+bin/mahout eigencuts \-i /path/to/directory/with/affinity/matrix \-o
+/output/path \-b 2 \-d 450
+
+In both cases, the "-d" flag refers to the dimensionality of the affinity
+matrix; since there are 450 points, this value will be 450. In spectral
+k-means, the "-k" is the number of clusters to estimate. In Eigencuts, the
+"-b" refers to an initial estimate on the half-life of the flow of
+probability in the graph; higher estimates correspond to fewer clusters.
+
+Spectral k-means will eventually yield the clustering results, and there
+should only be a single mistake: point 54. This is because that particular
+point is in between the two left-hand clusters in the image, and so has
+high affinities with both clusters.
+
+[Full overview of example here](http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/)
+
+<a name="SpectralClustering-Resources"></a>
+### Resources
+
+* [http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/](http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/)
+* [http://www.stanford.edu/class/ee378B/papers/luxburg-spectral.pdf](http://www.stanford.edu/class/ee378B/papers/luxburg-spectral.pdf)
+* [http://en.wikipedia.org/wiki/Laplacian_matrix](http://en.wikipedia.org/wiki/Laplacian_matrix)
+* [http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering](http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering)

Added: mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext (added)
+++ mahout/site/trunk/content/stochastic-singular-value-decomposition.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,73 @@
+Title: Stochastic Singular Value Decomposition
+Stochastic SVD method in Mahout produces reduced rank Singular Value
+Decomposition output in its strict mathematical definition: A=USV'.
+
+<a name="StochasticSingularValueDecomposition-Thebenefitsoverothermethodsare:"></a>
+##### The benefits over other methods are: 
+* reduced flops required compared to Krylov subspace methods
+* In map-reduce world, a fixed number of MR iterations required regardless
+of rank requested
+* Tweak precision/speed balance with options.
+* A is a Distributed Row Matrix where rows may be identified by any
+Writable (such as a document path). As such, it would work directly on the
+output of seq2sparse. 
+* As of 0.7 trunk, includes PCA and dimensionality reduction workflow
+(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is
+greatly appreciated.)
+
+map-reduce characteristics: 
+SSVD uses at most 3 MR _sequential_ steps (map-only + map-reduce + 2
+optional parallel map-reduce jobs) to produce reduced rank approximation of
+U, V and S matrices. Additionally, two more map-reduce steps are added for
+each power iteration step if requested.
+
+<a name="StochasticSingularValueDecomposition-Potentialdrawbacks:"></a>
+##### Potential drawbacks: 
+* potentially less precise (but adding even one power iteration seems to
+fix that quite a bit).
+
+<a name="StochasticSingularValueDecomposition-Documentation"></a>
+##### Documentation
+
+[Overview and Usage](^ssvd-cli.pdf.html)
+Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.
+
+<a name="StochasticSingularValueDecomposition-Publications"></a>
+##### Publications
+
+[Nathan Halko's dissertation](http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf)
+ "Randomized methods for computing low-rank
+approximations of matrices" contains comprehensive definition of
+parallelization strategy taken in Mahout SSVD implementation and also some
+precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation
+on a typical corpus data set.
+
+<a name="StochasticSingularValueDecomposition-Rsimulation"></a>
+##### R simulation
+[Non-parallel SSVD simulation in R with power iterations and PCA options](^ssvd.r.html)
+. Note that this implementation is not most optimal for sequential flow
+solver, but it is for demonstration purposes only. 
+
+However, try this R code to simulate a meaningful input:
+<DIV class="code panel" style="border-style: solid;border-width: 1px;"><DIV class="codeHeader panelHeader" style="border-bottom-width: 1px;border-bottom-style: solid;"><B>tests.R</B></DIV><DIV class="codeContent panelContent">
+    n<-1000
+    m<-2000
+    k<-10
+    
+    qi<-1
+    
+    #simulated input
+    svalsim<-diag(k:1)
+    
+    usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
+    vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
+    
+    
+    x<- usim %*% svalsim %*% t(vsim) 
+    
+
+
+and try to compare ssvd.svd\(x\) and stock svd\(x\) performance for the
+same rank k, notice the difference in the running time. Also play with
+power iterations (qIter) and compare accuracies of standard svd and SSVD.
+



Mime
View raw message