mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [20/51] [partial] mahout git commit: New Website courtesy of startbootstrap.com
Date Sat, 02 Dec 2017 06:09:04 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md b/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md
new file mode 100644
index 0000000..e348961
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md
@@ -0,0 +1,53 @@
+---
+layout: tutorial
+title: (Deprecated) 
+theme:
+    name: retro-mahout
+---
+
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+#Bank Marketing Example
+
+### Introduction
+
+This page describes how to run Mahout's SGD classifier on the [UCI Bank Marketing dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing).
+The goal is to predict if the client will subscribe a term deposit offered via a phone call. The features in the dataset consist
+of information such as age, job, marital status as well as information about the last contacts from the bank.
+
+### Code & Data
+
+The bank marketing example code lives under 
+
+*mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing*
+
+The data can be found at 
+
+*mahout-examples/src/main/resources/bank-full.csv*
+
+### Code details
+
+This example consists of 3 classes:
+
+  - BankMarketingClassificationMain
+  - TelephoneCall
+  - TelephoneCallParser
+
+When you run the main method of BankMarketingClassificationMain it parses the dataset using the TelephoneCallParser and trains
+a logistic regression model with 20 runs and 20 passes. The TelephoneCallParser uses Mahout's feature vector encoder
+to encode the features in the dataset into a vector. Afterwards the model is tested and the learning rate and AUC is printed accuracy is printed to standard output.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/breiman-example.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/classification/breiman-example.md b/website-old/docs/tutorials/map-reduce/classification/breiman-example.md
new file mode 100644
index 0000000..32f8c44
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/classification/breiman-example.md
@@ -0,0 +1,67 @@
+---
+layout: tutorial
+title: (Deprecated)  Breiman Example
+theme:
+    name: retro-mahout
+---
+
+#Breiman Example
+
+#### Introduction
+
+This page describes how to run the Breiman example, which implements the test procedure described in [Leo Breiman's paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf). The basic algorithm is as follows :
+
+ * repeat *I* iterations
+ * in each iteration do
+  * keep 10% of the dataset apart as a testing set 
+  * build two forests using the training set, one with *m = int(log2(M) + 1)* (called Random-Input) and one with *m = 1* (called Single-Input)
+  * choose the forest that gave the lowest oob error estimation to compute
+the test set error
+  * compute the test set error using the Single Input Forest (test error),
+this demonstrates that even with *m = 1*, Decision Forests give comparable
+results to greater values of *m*
+  * compute the mean testset error using every tree of the chosen forest
+(tree error). This should indicate how well a single Decision Tree performs
+ * compute the mean test error for all iterations
+ * compute the mean tree error for all iterations
+
+
+#### Running the Example
+
+The current implementation is compatible with the [UCI repository](http://archive.ics.uci.edu/ml/) file format. We'll show how to run this example on two datasets:
+
+First, we deal with [Glass Identification](http://archive.ics.uci.edu/ml/datasets/Glass+Identification): download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data) file called **glass.data** and store it onto your local machine. Next, we must generate the descriptor file **glass.info** for this dataset with the following command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/glass.data -f /path/to/glass.info -d I 9 N L
+
+Substitute */path/to/* with the folder where you downloaded the dataset, the argument "I 9 N L" indicates the nature of the variables. Here it means 1
+ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
+the label (L).
+
+Finally, we build and evaluate our random forest classifier as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/glass.data -ds /path/to/glass.info -i 10 -t 100
+which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
+argument) 
+
+The example outputs the following results:
+
+ * Selection error: mean test error for the selected forest on all iterations
+ * Single Input error: mean test error for the single input forest on all
+iterations
+ * One Tree error: mean single tree error on all iterations
+ * Mean Random Input Time: mean build time for random input forests on all
+iterations
+ * Mean Single Input Time: mean build time for single input forests on all
+iterations
+
+We can repeat this for a [Sonar](http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29) usecase: download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data) file called **sonar.all-data** and store it onto your local machine. Generate the descriptor file **sonar.info** for this dataset with the following command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
+
+The argument "60 N L" means 60 numerical(N) attributes, followed by the label (L). Analogous to the previous case, we run the evaluation as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md b/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md
new file mode 100644
index 0000000..2226e94
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md
@@ -0,0 +1,179 @@
+---
+layout: tutorial
+title: (Deprecated)  Twenty Newsgroups
+theme:
+    name: retro-mahout
+---
+
+
+<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
+## Twenty Newsgroups Classification Example
+
+<a name="TwentyNewsgroups-Introduction"></a>
+## Introduction
+
+The 20 newsgroups dataset is a collection of approximately 20,000
+newsgroup documents, partitioned (nearly) evenly across 20 different
+newsgroups. The 20 newsgroups collection has become a popular data set for
+experiments in text applications of machine learning techniques, such as
+text classification and text clustering. We will use the [Mahout CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html)
+classifier to create a model that would classify a new document into one of
+the 20 newsgroups.
+
+<a name="TwentyNewsgroups-Prerequisites"></a>
+### Prerequisites
+
+* Mahout has been downloaded ([instructions here](https://mahout.apache.org/general/downloads.html))
+* Maven is available
+* Your environment has the following variables:
+     * **HADOOP_HOME** Environment variables refers to where Hadoop lives 
+     * **MAHOUT_HOME** Environment variables refers to where Mahout lives
+
+<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
+### Instructions for running the example
+
+1. If running Hadoop in cluster mode, start the hadoop daemons by executing the following commands:
+
+            $ cd $HADOOP_HOME/bin
+            $ ./start-all.sh
+   
+    Otherwise:
+
+            $ export MAHOUT_LOCAL=true
+
+2. In the trunk directory of Mahout, compile and install Mahout:
+
+            $ cd $MAHOUT_HOME
+            $ mvn -DskipTests clean install
+
+3. Run the [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) by executing:
+
+            $ ./examples/bin/classify-20newsgroups.sh
+
+4. You will be prompted to select a classification method algorithm: 
+    
+            1. Complement Naive Bayes
+            2. Naive Bayes
+            3. Stochastic Gradient Descent
+
+Select 1 and the the script will perform the following:
+
+1. Create a working directory for the dataset and all input/output.
+2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory.
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
+4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
+5. Split the preprocessed dataset into training and testing sets. 
+6. Train the classifier.
+7. Test the classifier.
+
+
+Output should look something like:
+
+
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+     a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t <--Classified as
+    381 0  0  0  0  9  1  0  0  0  1  0  0  2  0  1  0  0  3  0 |398 a=rec.motorcycles
+     1 284 0  0  0  0  1  0  6  3  11 0  66 3  0  6  0  4  9  0 |395 b=comp.windows.x
+     2  0 339 2  0  3  5  1  0  0  0  0  1  1  12 1  7  0  2  0 |376 c=talk.politics.mideast
+     4  0  1 327 0  2  2  0  0  2  1  1  0  5  1  4  12 0  2  0 |364 d=talk.politics.guns
+     7  0  4  32 27 7  7  2  0  12 0  0  6  0 100 9  7  31 0  0 |251 e=talk.religion.misc
+     10 0  0  0  0 359 2  2  0  0  3  0  1  6  0  1  0  0  11 0 |396 f=rec.autos
+     0  0  0  0  0  1 383 9  1  0  0  0  0  0  0  0  0  3  0  0 |397 g=rec.sport.baseball
+     1  0  0  0  0  0  9 382 0  0  0  0  1  1  1  0  2  0  2  0 |399 h=rec.sport.hockey
+     2  0  0  0  0  4  3  0 330 4  4  0  5  12 0  0  2  0  12 7 |385 i=comp.sys.mac.hardware
+     0  3  0  0  0  0  1  0  0 368 0  0  10 4  1  3  2  0  2  0 |394 j=sci.space
+     0  0  0  0  0  3  1  0  27 2 291 0  11 25 0  0  1  0  13 18|392 k=comp.sys.ibm.pc.hardware
+     8  0  1 109 0  6  11 4  1  18 0  98 1  3  11 10 27 1  1  0 |310 l=talk.politics.misc
+     0  11 0  0  0  3  6  0  10 6  11 0 299 13 0  2  13 0  7  8 |389 m=comp.graphics
+     6  0  1  0  0  4  2  0  5  2  12 0  8 321 0  4  14 0  8  6 |393 n=sci.electronics
+     2  0  0  0  0  0  4  1  0  3  1  0  3  1 372 6  0  2  1  2 |398 o=soc.religion.christian
+     4  0  0  1  0  2  3  3  0  4  2  0  7  12 6 342 1  0  9  0 |396 p=sci.med
+     0  1  0  1  0  1  4  0  3  0  1  0  8  4  0  2 369 0  1  1 |396 q=sci.crypt
+     10 0  4  10 1  5  6  2  2  6  2  0  2  1 86 15 14 152 0  1 |319 r=alt.atheism
+     4  0  0  0  0  9  1  1  8  1  12 0  3  0  2  0  0  0 341 2 |390 s=misc.forsale
+     8  5  0  0  0  1  6  0  8  5  50 0  40 2  1  0  9  0  3 256|394 t=comp.os.ms-windows.misc
+    =======================================================
+    Statistics
+    -------------------------------------------------------
+    Kappa                                       0.8808
+    Accuracy                                   90.8596%
+    Reliability                                86.3632%
+    Reliability (standard deviation)            0.2131
+
+
+
+
+
+<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
+## End to end commands to build a CBayes model for 20 newsgroups
+The [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script: 
+
+*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your **$PATH***
+
+1. Create a working directory for the dataset and all input/output.
+           
+            $ export WORK_DIR=/tmp/mahout-work-${USER}
+            $ mkdir -p ${WORK_DIR}
+
+2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory.
+
+            $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 
+                -o ${WORK_DIR}/20news-bydate.tar.gz
+            $ mkdir -p ${WORK_DIR}/20news-bydate
+            $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
+            $ mkdir ${WORK_DIR}/20news-all
+            $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
+     * If you're running on a Hadoop cluster:
+ 
+            $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
+
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
+          
+            $ mahout seqdirectory 
+                -i ${WORK_DIR}/20news-all 
+                -o ${WORK_DIR}/20news-seq 
+                -ow
+            
+4. Convert and preprocesses the dataset into  a < Text, VectorWritable > SequenceFile containing term frequencies for each document. 
+            
+            $ mahout seq2sparse 
+                -i ${WORK_DIR}/20news-seq 
+                -o ${WORK_DIR}/20news-vectors
+                -lnorm 
+                -nv 
+                -wt tfidf
+If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization.  See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) page for a list of all seq2sparse options.   
+
+5. Split the preprocessed dataset into training and testing sets.
+
+            $ mahout split 
+                -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
+                --trainingOutput ${WORK_DIR}/20news-train-vectors 
+                --testOutput ${WORK_DIR}/20news-test-vectors  
+                --randomSelectionPct 40 
+                --overwrite --sequenceFiles -xm sequential
+ 
+6. Train the classifier.
+
+            $ mahout trainnb 
+                -i ${WORK_DIR}/20news-train-vectors
+                -el  
+                -o ${WORK_DIR}/model 
+                -li ${WORK_DIR}/labelindex 
+                -ow 
+                -c
+
+7. Test the classifier.
+
+            $ mahout testnb 
+                -i ${WORK_DIR}/20news-test-vectors
+                -m ${WORK_DIR}/model 
+                -l ${WORK_DIR}/labelindex 
+                -ow 
+                -o ${WORK_DIR}/20news-testing 
+                -c
+
+ 
+       
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md b/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
new file mode 100644
index 0000000..ab80054
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
@@ -0,0 +1,57 @@
+---
+layout: tutorial
+title: (Deprecated)  Wikipedia XML parser and Naive Bayes Example
+theme:
+    name: retro-mahout
+---
+# Wikipedia XML parser and Naive Bayes Classifier Example
+
+## Introduction
+Mahout has an [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1] which will download a recent XML dump of the (entire if desired) [English Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running the classification script, you can use the [document classification script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) from the Mahout [spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset.  
+
+You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom).
+
+## Oververview
+
+Tou run the example simply execute the `$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script.
+
+By defult the script is set to run on a medium sized Wikipedia XML dump.  To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80  of [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1]. However this is not recommended unless you have the resources to do so. *Be sure to clean your work directory when changing datasets- option (3).*
+
+The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html) [4].  The only difference being that instead of running `$mahout seqdirectory` on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the unzipped Wikipedia xml dump.
+
+    $ mahout seqwiki 
+
+The above command launches `WikipediaToSequenceFile.java` which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file.  This process will seek to extract documents with a wikipedia category tag which (exactly, if the `-exactMatchOnly` option is set) matches a line in the category file.  If no match is found and the `-all` option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a `<Text,Text>` sequence file of the form (K:/category/document_title , V: document).
+
+There are 3 different example category files available to in the /examples/src/test/resources
+directory:  country.txt, country10.txt and country2.txt.  You can edit these categories to extract a different corpus from the Wikipedia dataset.
+
+The CLI options for `seqwiki` are as follows:
+
+    --input          (-i)         input pathname String
+    --output         (-o)         the output pathname String
+    --categories     (-c)         the file containing the Wikipedia categories
+    --exactMatchOnly (-e)         if set, then the Wikipedia category must match
+                                    exactly instead of simply containing the category string
+    --all            (-all)       if set select all categories
+    --removeLabels   (-rl)        if set, remove [[Category:labels]] from document text after extracting label.
+
+
+After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` as in the [step by step 20newsgroups example](http://mahout.apache.org/users/classification/twenty-newsgroups.html).  When all of the jobs have finished, a confusion matrix will be displayed.
+
+#Resourcese
+
+[1] [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
+
+[2] [Document classification script for the Mahout Spark Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+
+[3] [Example category file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt)
+
+[4] [Step by step instructions for building a Naive Bayes classifier for 20newsgroups from the command line](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
+
+[5] [Mahout MapReduce Naive Bayes](http://mahout.apache.org/users/classification/bayesian.html)
+
+[6] [Mahout Spark Naive Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html)
+
+[7] [Mahout Scala Spark and H2O Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md b/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md
new file mode 100644
index 0000000..e39d989
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md
@@ -0,0 +1,11 @@
+---
+layout: tutorial
+title: (Deprecated)  20Newsgroups
+theme:
+   name: retro-mahout
+---
+
+<a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a>
+# Naive Bayes using 20 Newsgroups Data
+
+See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md b/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md
new file mode 100644
index 0000000..e7f2b21
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md
@@ -0,0 +1,70 @@
+---
+layout: tutorial
+title: (Deprecated)  canopy-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a>
+# Running Canopy Clustering from the Command Line
+Mahout's Canopy clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Canopy on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout canopy <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+
+<a name="canopy-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="canopy-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			     Path to job input directory.Must  
+    					     be a SequenceFile of	    
+    					     VectorWritable		    
+      --output (-o) output			     The directory pathname for output. 
+      --overwrite (-ow)			     If present, overwrite the output	 
+    					     directory before running job   
+      --distanceMeasure (-dm) distanceMeasure    The classname of the	    
+    					     DistanceMeasure. Default is    
+    					     SquaredEuclidean		    
+      --t1 (-t1) t1 			     T1 threshold value 	    
+      --t2 (-t2) t2 			     T2 threshold value 	    
+      --clustering (-cl)			     If present, run clustering after	
+    					     the iterations have taken place	 
+      --help (-h)				     Print out help		    
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md b/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
new file mode 100644
index 0000000..201e9d8
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
@@ -0,0 +1,53 @@
+---
+layout: tutorial
+title: (Deprecated)  Clustering of synthetic control data
+theme:
+   name: retro-mahout
+---
+
+# Clustering synthetic control data
+
+## Introduction
+
+This example will demonstrate clustering of time series data, specifically control charts. [Control charts](http://en.wikipedia.org/wiki/Control_chart) are tools used to determine whether a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated repeatedly at equal time intervals. A [simulated dataset](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html) is available for use in UCI machine learning repository.
+
+A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and is meant to resemble real world information in an anonymized format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In this example we will use Mahout to cluster the data into corresponding class buckets. 
+
+*For the sake of simplicity, we won't use a cluster in this example, but instead show you the commands to run the clustering examples locally with Hadoop*.
+
+## Setup
+
+We need to do some initial setup before we are able to run the example. 
+
+
+  1. Start out by downloading the dataset to be clustered from the UCI Machine Learning Repository: [http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data).
+
+  2. Download the [latest release of Mahout](/general/downloads.html).
+
+  3. Unpack the release binary and switch to the *mahout-distribution-0.x* folder
+
+  4. Make sure that the *JAVA_HOME* environment variable points to your local java installation
+
+  5. Create a folder called *testdata* in the current directory and copy the dataset into this folder.
+
+
+## Clustering Examples
+
+Depending on the clustering algorithm you want to run, the following commands can be used:
+
+
+   * [Canopy Clustering](/users/clustering/canopy-clustering.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
+
+   * [k-Means Clustering](/users/clustering/k-means-clustering.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
+
+
+   * [Fuzzy k-Means Clustering](/users/clustering/fuzzy-k-means.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
+
+The clustering output will be produced in the *output* directory. The output data points are in vector format. In order to read/analyze the output, you can use the [clusterdump](/users/clustering/cluster-dumper.html) utility provided by Mahout.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md b/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
new file mode 100644
index 0000000..35b2008
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
@@ -0,0 +1,11 @@
+---
+layout: tutorial
+title: (Deprecated)  Clustering Seinfeld Episodes
+theme:
+   name: retro-mahout
+---
+
+Below is short tutorial on how to cluster Seinfeld episode transcripts with
+Mahout.
+
+http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md b/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
new file mode 100644
index 0000000..ba7cb0b
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
@@ -0,0 +1,126 @@
+---
+layout: tutorial
+title: (Deprecated)  ClusteringYourData
+theme:
+   name: retro-mahout
+---
+
+# Clustering your data
+
+After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own
+data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.
+
+The following pieces *may* be useful for in getting started:
+
+<a name="ClusteringYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html).
+In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html).
+
+
+<a name="ClusteringYourData-RunningtheProcess"></a>
+# Running the Process
+
+* [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html).
+
+* [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and
+[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).
+
+* [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html).
+
+* [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html).
+
+* [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html).
+
+* TODO: kmeans++/ streaming kMeans documentation
+
+
+<a name="ClusteringYourData-RetrievingtheOutput"></a>
+# Retrieving the Output
+
+Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.
+
+    ./bin/mahout clusterdump <OPTIONS>
+
+
+<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
+## The cluster dumper options are:
+
+      --help (-h)				   Print out help	
+	    
+      --input (-i) input			   The directory containing Sequence    
+    					   Files for the Clusters	    
+
+      --output (-o) output			   The output file.  If not specified,  
+    					   dumps to the console.
+
+      --outputFormat (-of) outputFormat	   The optional output format to write
+    					   the results as. Options: TEXT, CSV, or GRAPH_ML		 
+
+      --substring (-b) substring		   The number of chars of the	    
+    					   asFormatString() to print	
+    
+      --pointsDir (-p) pointsDir		   The directory containing points  
+ 					   sequence files mapping input vectors     					   to their cluster.  If specified, 
+    					   then the program will output the 
+    					   points associated with a cluster 
+
+      --dictionary (-d) dictionary		   The dictionary file. 	    
+
+      --dictionaryType (-dt) dictionaryType    The dictionary file type	    
+    					   (text|sequencefile)
+
+      --distanceMeasure (-dm) distanceMeasure  The classname of the DistanceMeasure.
+    					   Default is SquaredEuclidean.     
+
+      --numWords (-n) numWords		   The number of top terms to print 
+
+      --tempDir tempDir			   Intermediate output directory
+
+      --startPhase startPhase		   First phase to run
+
+      --endPhase endPhase			   Last phase to run
+
+      --evaluate (-e)			   Run ClusterEvaluator and CDbwEvaluator over the
+    					   input. The output will be appended to the rest of
+    					   the output at the end.   
+
+
+More information on using clusterdump utility can be found [here](cluster-dumper.html)
+
+<a name="ClusteringYourData-ValidatingtheOutput"></a>
+# Validating the Output
+
+{quote}
+Ted Dunning: A principled approach to cluster evaluation is to measure how well the
+cluster membership captures the structure of unseen data.  A natural
+measure for this is to measure how much of the entropy of the data is
+captured by cluster membership.  For k-means and its natural L_2 metric,
+the natural cluster quality metric is the squared distance from the nearest
+centroid adjusted by the log_2 of the number of clusters.  This can be
+compared to the squared magnitude of the original data or the squared
+deviation from the centroid for all of the data.  The idea is that you are
+changing the representation of the data by allocating some of the bits in
+your original representation to represent which cluster each point is in. 
+If those bits aren't made up by the residue being small then your
+clustering is making a bad trade-off.
+
+In the past, I have used other more heuristic measures as well.  One of the
+key characteristics that I would like to see out of a clustering is a
+degree of stability.  Thus, I look at the fractions of points that are
+assigned to each cluster or the distribution of distances from the cluster
+centroid. These values should be relatively stable when applied to held-out
+data.
+
+For text, you can actually compute perplexity which measures how well
+cluster membership predicts what words are used.  This is nice because you
+don't have to worry about the entropy of real valued numbers.
+
+Manual inspection and the so-called laugh test is also important.  The idea
+is that the results should not be so ludicrous as to make you laugh.
+Unfortunately, it is pretty easy to kid yourself into thinking your system
+is working using this kind of inspection.  The problem is that we are too
+good at seeing (making up) patterns.
+{quote}
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md b/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
new file mode 100644
index 0000000..721256d
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
@@ -0,0 +1,97 @@
+---
+layout: tutorial
+title: (Deprecated)  fuzzy-k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a>
+# Running Fuzzy k-Means Clustering from the Command Line
+Mahout's Fuzzy k-Means clustering can be launched from the same command
+line invocation whether you are running on a single machine in stand-alone
+mode or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run FuzzyK on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout fkmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+
+<a name="fuzzy-k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="fuzzy-k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --clusters (-c) clusters		       The input centroids, as Vectors. 
+    					       Must be a SequenceFile of    
+    					       Writable, Cluster/Canopy. If k  
+    					       is also specified, then a random 
+    					       set of vectors will be selected  
+    					       and written out to this path 
+    					       first			    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --k (-k) k				       The k in k-Means.  If specified, 
+    					       then a random selection of k 
+    					       Vectors will be chosen as the
+        					       Centroid and written to the  
+    					       clusters input path.	    
+      --m (-m) m				       coefficient normalization    
+    					       factor, must be greater than 1   
+      --overwrite (-ow)			       If present, overwrite the output 
+    					       directory before running job 
+      --help (-h)				       Print out help		    
+      --numMap (-u) numMap			       The number of map tasks.     
+    					       Defaults to 10		    
+      --maxRed (-r) maxRed			       The number of reduce tasks.  
+    					       Defaults to 2		    
+      --emitMostLikely (-e) emitMostLikely	       True if clustering should emit   
+    					       the most likely point only,  
+    					       false for threshold clustering.  
+    					       Default is true		    
+      --threshold (-t) threshold		       The pdf threshold used for   
+    					       cluster determination. Default   
+    					       is 0 
+      --clustering (-cl)			       If present, run clustering after 
+    					       the iterations have taken place  
+                                
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md b/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md
new file mode 100644
index 0000000..b9ac430
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md
@@ -0,0 +1,94 @@
+---
+layout: tutorial
+title: (Deprecated)  k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="k-means-commandline-Introduction"></a>
+# kMeans commandline introduction
+
+This quick start page describes how to run the kMeans clustering algorithm
+on a Hadoop cluster. 
+
+<a name="k-means-commandline-Steps"></a>
+# Steps
+
+Mahout's k-Means clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run k-Means on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout kmeans <OPTIONS>
+
+
+In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+
+<a name="k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --clusters (-c) clusters		       The input centroids, as Vectors. 
+    					       Must be a SequenceFile of    
+    					       Writable, Cluster/Canopy. If k  
+    					       is also specified, then a random 
+    					       set of vectors will be selected  
+    					       and written out to this path 
+    					       first			    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --maxRed (-r) maxRed			       The number of reduce tasks.  
+    					       Defaults to 2		    
+      --k (-k) k				       The k in k-Means.  If specified, 
+    					       then a random selection of k 
+    					       Vectors will be chosen as the    
+    					       Centroid and written to the  
+    					       clusters input path.	    
+      --overwrite (-ow)			       If present, overwrite the output 
+    					       directory before running job 
+      --help (-h)				       Print out help		    
+      --clustering (-cl)			       If present, run clustering after 
+    					       the iterations have taken place  
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md b/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md
new file mode 100644
index 0000000..6b10681
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md
@@ -0,0 +1,83 @@
+---
+layout: tutorial
+title: (Deprecated)  lda-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
+# Running Latent Dirichlet Allocation (algorithm) from the Command Line
+[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
+ lda has been implemented as Collapsed Variable Bayes (cvb). 
+
+Mahout's LDA can be launched from the same command line invocation whether
+you are running on a single machine in stand-alone mode or on a larger
+Hadoop cluster. The difference is determined by the $HADOOP_HOME and
+$HADOOP_CONF_DIR environment variables. If both are set to an operating
+Hadoop cluster on the target machine then the invocation will run the LDA
+algorithm on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+
+    ./bin/mahout cvb <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+
+<a name="lda-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
+# Command line options from Mahout cvb version 0.8
+
+    mahout cvb -h 
+      --input (-i) input					  Path to job input directory.	      
+      --output (-o) output					  The directory pathname for output.  
+      --maxIter (-x) maxIter				  The maximum number of iterations.		
+      --convergenceDelta (-cd) convergenceDelta		  The convergence delta value		    
+      --overwrite (-ow)					  If present, overwrite the output directory before running job    
+      --num_topics (-k) num_topics				  Number of topics to learn		 
+      --num_terms (-nt) num_terms				  Vocabulary size   
+      --doc_topic_smoothing (-a) doc_topic_smoothing	  Smoothing for document/topic distribution	     
+      --term_topic_smoothing (-e) term_topic_smoothing	  Smoothing for topic/term distribution 	 
+      --dictionary (-dict) dictionary			  Path to term-dictionary file(s) (glob expression supported) 
+      --doc_topic_output (-dt) doc_topic_output		  Output path for the training doc/topic distribution	     
+      --topic_model_temp_dir (-mt) topic_model_temp_dir	  Path to intermediate model path (useful for restarting)       
+      --iteration_block_size (-block) iteration_block_size	  Number of iterations per perplexity check  
+      --random_seed (-seed) random_seed			  Random seed	    
+      --test_set_fraction (-tf) test_set_fraction		  Fraction of data to hold out for testing  
+      --num_train_threads (-ntt) num_train_threads		  number of threads per mapper to train with  
+      --num_update_threads (-nut) num_update_threads	  number of threads per mapper to update the model with	       
+      --max_doc_topic_iters (-mipd) max_doc_topic_iters	  max number of iterations per doc for p(topic|doc) learning		  
+      --num_reduce_tasks num_reduce_tasks			  number of reducers to use during model estimation 	   
+      --backfill_perplexity 				  enable backfilling of missing perplexity values		
+      --help (-h)						  Print out help    
+      --tempDir tempDir					  Intermediate output directory	     
+      --startPhase startPhase				  First phase to run    
+      --endPhase endPhase					  Last phase to run
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md b/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md
new file mode 100644
index 0000000..ce9dd91
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md
@@ -0,0 +1,15 @@
+---
+layout: tutorial
+title: (Deprecated)  Viewing Result
+theme:
+   name: retro-mahout
+---
+* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages)
+
+There are various technologies available to view the output of Mahout
+algorithms.
+* Clusters
+
+<a name="ViewingResult-AlgorithmViewingpages"></a>
+# Algorithm Viewing pages
+{pagetree:root=@self|excerpt=true|expandCollapseAll=true}

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md b/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md
new file mode 100644
index 0000000..1b5092f
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md
@@ -0,0 +1,49 @@
+---
+layout: tutorial
+title: (Deprecated)  Viewing Results
+theme:
+   name: retro-mahout
+---
+<a name="ViewingResults-Intro"></a>
+# Intro
+
+Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
+sequence files or other data structures.  This page is intended to
+demonstrate the various ways one might inspect the outcome of various jobs.
+ The page is organized by algorithms.
+
+<a name="ViewingResults-GeneralUtilities"></a>
+# General Utilities
+
+<a name="ViewingResults-SequenceFileDumper"></a>
+## Sequence File Dumper
+
+
+<a name="ViewingResults-Clustering"></a>
+# Clustering
+
+<a name="ViewingResults-ClusterDumper"></a>
+## Cluster Dumper
+
+Run the following to print out all options:
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help
+
+
+
+<a name="ViewingResults-Example"></a>
+### Example
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir
+./solr-clust-n2/out/clusters-2
+          --dictionary ./solr-clust-n2/dictionary.txt
+          --substring 100 --pointsDir ./solr-clust-n2/out/points/
+    
+
+
+
+<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a>
+## Cluster Labels (MAHOUT-163)
+
+<a name="ViewingResults-Classification"></a>
+# Classification

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md b/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
new file mode 100644
index 0000000..fe4b93f
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
@@ -0,0 +1,50 @@
+---
+layout: tutorial
+title: (Deprecated)  Visualizing Sample Clusters
+theme:
+   name: retro-mahout
+---
+
+<a name="VisualizingSampleClusters-Introduction"></a>
+# Introduction
+
+Mahout provides examples to visualize sample clusters that gets created by
+our clustering algorithms. Note that the visualization is done by Swing programs. You have to be in a window system on the same
+machine you run these, or logged in via a remote desktop.
+
+For visualizing the clusters, you have to execute the Java
+classes under *org.apache.mahout.clustering.display* package in
+mahout-examples module. The easiest way to achieve this is to [setup Mahout](users/basics/quickstart.html) in your IDE.
+
+<a name="VisualizingSampleClusters-Visualizingclusters"></a>
+# Visualizing clusters
+
+The following classes in *org.apache.mahout.clustering.display* can be run
+without parameters to generate a sample data set and run the reference
+clustering implementations over them:
+
+1. **DisplayClustering** - generates 1000 samples from three, symmetric
+distributions. This is the same data set that is used by the following
+clustering programs. It displays the points on a screen and superimposes
+the model parameters that were used to generate the points. You can edit
+the *generateSamples()* method to change the sample points used by these
+programs.
+1. **DisplayClustering** - displays initial areas of generated points
+1. **DisplayCanopy** - uses Canopy clustering
+1. **DisplayKMeans** - uses k-Means clustering
+1. **DisplayFuzzyKMeans** - uses Fuzzy k-Means clustering
+1. **DisplaySpectralKMeans** - uses Spectral KMeans via map-reduce algorithm
+
+If you are using Eclipse, just right-click on each of the classes mentioned above and choose "Run As -Java Application". To run these directly from the command line:
+
+    cd $MAHOUT_HOME/examples
+    mvn -q exec:java -Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
+
+You can substitute other names above for *DisplayClustering*. 
+
+
+Note that some of these programs display the sample points and then superimpose all of the clusters from each iteration. The last iteration's clusters are in
+bold red and the previous several are colored (orange, yellow, green, blue,
+magenta) in order after which all earlier clusters are in light grey. This
+helps to visualize how the clusters converge upon a solution over multiple
+iterations.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/index.md b/website-old/docs/tutorials/map-reduce/index.md
new file mode 100644
index 0000000..b1a269c
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/index.md
@@ -0,0 +1,19 @@
+---
+layout: tutorial
+title: (Deprecated)  Deprecated Map Reduce Based Examples
+theme:
+    name: mahout2
+---
+
+A note about the sunsetting of our support for Map Reduce.
+
+
+### Classification
+
+[Bank Marketing Example](classification/bankmarketing-example.html)
+
+[Breiman Exampe](classification/breiman-example.html)
+
+[Twenty Newsgroups](classification/twenty-newsgroups.html)
+
+[Wikipedia Classifier Example](classification/wikipedia-classifier-example.html)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md b/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md
new file mode 100644
index 0000000..a85d0cd
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md
@@ -0,0 +1,19 @@
+---
+layout: default
+title: (Deprecated)  MR - Map Reduce
+theme:
+   name: retro-mahout
+---
+
+{excerpt}MapReduce is a framework for processing huge datasets on certain
+kinds of distributable problems using a large number of computers (nodes),
+collectively referred to as a cluster.{excerpt} Computational processing
+can occur on data stored either in a filesystem (unstructured) or within a
+database (structured).
+
+&nbsp; Also written M/R
+
+
+&nbsp; See Also
+* [http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce)
+* [http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md b/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
new file mode 100644
index 0000000..213c385
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
@@ -0,0 +1,185 @@
+---
+layout: default
+title: (Deprecated)  Parallel Frequent Pattern Mining
+theme:
+    name: retro-mahout
+---
+Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper [http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf)
+ with some optimisations in mining the data.
+
+Given a huge transaction list, the algorithm finds all unique features(sets
+of field values) and eliminates those features whose frequency in the whole
+dataset is less that minSupport. Using these remaining features N, we find
+the top K closed patterns for each of them, generating a total of NxK
+patterns. FPGrowth Algorithm is a generic implementation, we can use any
+Object type to denote a feature. Current implementation requires you to use
+a String as the object type. You may implement a version for any object by
+creating Iterators, Convertors and TopKPatternWritable for that particular
+object. For more information please refer the package
+org.apache.mahout.fpm.pfpgrowth.convertors.string
+
+    e.g:
+     FPGrowth<String> fp = new FPGrowth<String>();
+     Set<String> features = new HashSet<String>();
+     fp.generateTopKStringFrequentPatterns(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern),
+    	fp.generateFList(
+    	  new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern), minSupport),
+    	 minSupport,
+    	maxHeapSize,
+    	features,
+    	new StringOutputConvertor(new SequenceFileOutputCollector<Text,
+TopKStringPatterns>(writer))
+      );
+
+* The first argument is the iterator of transaction in this case its
+Iterator<List<String>>
+* The second argument is the output of generateFList function, which
+returns the frequent items and their frequencies from the given database
+transaction iterator
+* The third argument is the minimum Support of the pattern to be generated
+* The fourth argument is the maximum number of patterns to be mined for
+each feature
+* The fifth argument is the set of features for which the frequent patterns
+has to be mined
+* The last argument is an output collector which takes \[key, value\](key,-value\.html)
+ of Feature and TopK Patterns of the format \[String,
+List<Pair<List<String>, Long>>\] and writes them to the appropriate writer
+class which takes care of storing the object, in this case in a Sequence
+File Output format
+
+<a name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a>
+## Running Frequent Pattern Growth via command line
+
+The command line launcher for string transaction data
+org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
+specifying the regex pattern for spitting a string line of a transaction
+into the constituent features.
+
+Input files have to be in the following format.
+
+<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....
+
+instead of tab you could use , or \| as the default tokenization is done using a java Regex pattern {code}[,\t](,\t.html)
+*[,|\t][ ,\t]*{code}
+You can override this parameter to parse your log files or transaction
+files (each line is a transaction.) The FPGrowth algorithm mines the top K
+frequently occurring sets of items and their counts from the given input
+data
+
+$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
+format. 
+Other sample files are accident.dat.gz from [http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+. As a quick test, try this:
+
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+The minimumSupport parameter \-s is the minimum number of times a pattern
+or a feature needs to occur in the dataset so that it is included in the
+patterns generated. You can speed up the process by having a large value of
+s. There are cases where you will have less than k patterns for a
+particular feature as the rest don't for qualify the minimum support
+criteria
+
+Note that the input to the algorithm, could be uncompressed or compressed
+gz file or even a directory containing any number of such files.
+We modified the regex to use space to split the token. Note that input
+regex string is escaped.
+
+<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a>
+## Running Parallel FPGrowth
+
+Running parallel FPGrowth is as easy as adding changing the flag \-method
+mapreduce and adding the number of groups parameter e.g. \-g 20 for 20
+groups. First, let's run the above sample test in map-reduce mode:
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
+the sequential mode, (with 5 gigs of ram allocated). In a separate test,
+the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
+seconds in sequential mode.
+
+Here is another dataset which, while several times larger, requires much
+less time to find frequent patterns, as there are very few. Get
+accidents.dat.gz from [http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+ and place it on your hdfs in a folder named accidents. Then, run the
+hadoop version of the FPGrowth job:
+
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+OR to run a dataset of this size in sequential mode on a single machine
+let's give Mahout a lot more memory and only keep features with more than
+300 members:
+
+    export MAHOUT_HEAPSIZE=-Xmx5000m
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+
+The numGroups parameter \-g in FPGrowthJob specifies the number of groups
+into which transactions have to be decomposed. The default of 1000 works
+very well on a single-machine cluster; this may be very different on large
+clusters.
+
+Note that accidents.dat has 340 unique features. So we chose \-g 10 to
+split the transactions across 10 shards where 34 patterns are mined from
+each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
+takes care of calculating the split. For better performance in large
+datasets and clusters, try not to mine for more than 20-25 features per
+shard. Stick to the defaults on a small machine.
+
+The numTreeCacheEntries parameter \-tc specifies the number of generated
+conditional FP-Trees to be kept in memory so that subsequent operations do
+not to regenerate them. Increasing this number increases the memory
+consumption but might improve speed until a certain point. This depends
+entirely on the dataset in question. A value of 5-10 is recommended for
+mining up to top 100 patterns for each feature.
+
+<a name="ParallelFrequentPatternMining-Viewingtheresults"></a>
+## Viewing the results
+The output will be dumped to a SequenceFile in the frequentpatterns
+directory in Text=>TopKStringPatterns format. Run this command to see a few
+of the Frequent Patterns:
+
+    bin/mahout seqdumper \
+         -i patterns/frequentpatterns/part-?-00000 \
+         -n 4
+
+or replace -n 4 with -c for the count of patterns.
+ 
+Open questions: how does one experiment and monitor with these various
+parameters?

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md b/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
new file mode 100644
index 0000000..a3c7063
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
@@ -0,0 +1,41 @@
+---
+layout: default
+title: (Deprecated)  Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a>
+# Classification with Perceptron or Winnow
+
+Both algorithms are comparably simple linear classifiers. Given training
+data in some n-dimensional vector space that is annotated with binary
+labels the algorithms are guaranteed to find a linear separating hyperplane
+if one exists. In contrast to the Perceptron, Winnow works only for binary
+feature vectors.
+
+For more information on the Perceptron see for instance:
+http://en.wikipedia.org/wiki/Perceptron
+
+Concise course notes on both algorithms:
+http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf
+
+Although the algorithms are comparably simple they still work pretty well
+for text classification and are fast to train even for huge example sets.
+In contrast to Naive Bayes they are not based on the assumption that all
+features (in the domain of text classification: all terms in a document)
+are independent.
+
+<a name="PerceptronandWinnow-Strategyforparallelisation"></a>
+## Strategy for parallelisation
+
+Currently the strategy for parallelisation is simple: Given there is enough
+training data, split the training data. Train the classifier on each split.
+The resulting hyperplanes are then averaged.
+
+<a name="PerceptronandWinnow-Roadmap"></a>
+## Roadmap
+
+Currently the patch only contains the code for the classifier itself. It is
+planned to provide unit tests and at least one example based on the WebKB
+dataset by the end of November for the serial version. After that the
+parallelisation will be added.

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/testing.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/testing.md b/website-old/docs/tutorials/map-reduce/misc/testing.md
new file mode 100644
index 0000000..826fff8
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/testing.md
@@ -0,0 +1,46 @@
+---
+layout: default
+title: (Deprecated)  Testing
+theme:
+    name: retro-mahout
+---
+<a name="Testing-Intro"></a>
+# Intro
+
+As Mahout matures, solid testing procedures are needed.  This page and its
+children capture test plans along with ideas for improving our testing.
+
+<a name="Testing-TestPlans"></a>
+# Test Plans
+
+* [0.6](0.6.html)
+ - Test Plans for the 0.6 release
+There are no special plans except for unit tests, and user testing of the
+Hadoop jobs.
+
+<a name="Testing-TestIdeas"></a>
+# Test Ideas
+
+<a name="Testing-Regressions/Benchmarks/Integrations"></a>
+## Regressions/Benchmarks/Integrations
+* Algorithmic quality and speed are not tested, except in a few instances.
+Such tests often require much longer run times (minutes to hours), a
+running Hadoop cluster, and downloads of large datasets (in the megabytes). 
+* Standardized speed tests are difficult on different hardware. 
+* Unit tests of external integrations require access to externals: HDFS,
+S3, JDBC, Cassandra, etc. 
+
+Apache Jenkins is not able to support these environments. Commercial
+donations would help. 
+
+<a name="Testing-UnitTests"></a>
+## Unit Tests
+Mahout's current tests are almost entirely unit tests. Algorithm tests
+generally supply a few numbers to code paths and verify that expected
+numbers come out. 'mvn test' runs these tests. There is "positive" coverage
+of a great many utilities and algorithms. A much smaller percent include
+"negative" coverage (bogus setups, inputs, combinations).
+
+<a name="Testing-Other"></a>
+## Other
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md b/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
new file mode 100644
index 0000000..4b6af12
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
@@ -0,0 +1,222 @@
+---
+layout: default
+title: (Deprecated)  Using Mahout with Python via JPype
+theme:
+    name: retro-mahout
+---
+
+<a name="UsingMahoutwithPythonviaJPype-overview"></a>
+# Mahout over Jython - some examples
+This tutorial provides some sample code illustrating how we can read and
+write sequence files containing Mahout vectors from Python using JPype.
+This tutorial is intended for people who want to use Python for analyzing
+and plotting Mahout data. Using Mahout from Python turns out to be quite
+easy.
+
+This tutorial concerns the use of cPython (cython) as opposed to Jython.
+JPython wasn't an option for me, because  (to the best of my knowledge)
+JPython doesn't work with Python extensions numpy, matplotlib, or h5py
+which I rely on heavily.
+
+The instructions below explain how to setup a python script to read and
+write the output of Mahout clustering.
+
+You will first need to download and install the JPype package for python.
+
+The first step to setting up JPype is determining the path to the dynamic
+library for the jvm ; on linux this will be a .so file on and on windows it
+will be a .dll.
+
+In your python script, create a global variable with the path to this dll
+
+
+
+Next we need to figure out how we need to set the classpath for mahout. The
+easiest way to do this is to edit the script in "bin/mahout" to print out
+the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
+the comment "run it" (this is line 195 or so). Execute the script to print
+out the classpath.  Copy this output and paste it into a variable in your
+python script. The result for me looks like the following
+
+
+
+
+Now we can create a function to start the jvm in python using jype
+
+    jvm=None
+    def start_jpype():
+    global jvm
+    if (jvm is None):
+    cpopt="-Djava.class.path={cp}".format(cp=classpath)
+    startJVM(jvmlib,"-ea",cpopt)
+    jvm="started"
+
+
+
+<a name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
+# Writing Named Vectors to Sequence Files from Python
+We can now use JPype to create sequence files which will contain vectors to
+be used by Mahout for kmeans. The example below is a function which creates
+vectors from two Gaussian distributions with unit variance.
+
+
+    def create_inputs(ifile,*args,**param):
+     """Create a sequence file containing some normally distributed
+    	ifile - path to the sequence file to create
+     """
+     
+     #matrix of the cluster means
+     cmeans=np.array([[1,1] ,[-1,-1]],np.int)
+     
+     nperc=30  #number of points per cluster
+     
+     vecs=[]
+     
+     vnames=[]
+     for cind in range(cmeans.shape[0]):
+      pts=np.random.randn(nperc,2)
+      pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
+      vecs.append(pts)
+     
+      #names for the vectors
+      #names are just the points with an index
+      #we do this so we can validate by cross-refencing the name with thevector
+      vn=np.empty(nperc,dtype=(np.str,30))
+      for row in range(nperc):
+       vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
+      vnames.append(vn)
+      
+     vecs=np.vstack(vecs)
+     vnames=np.hstack(vnames)
+     
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     writer=io.SequenceFile.createWriter(fs, conf, path,io.Text,VectorWritableCls)
+     
+     
+     vecwritable=VectorWritableCls()
+     for row in range(vecs.shape[0]):
+      nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
+      #need to wrap key and value because of overloading
+      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
+      wrapval=JObject(vecwritable,io.Writable)
+      
+      vecwritable.set(nvector)
+      writer.append(wrapkey,wrapval)
+      
+     writer.close()
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
+# Reading the KMeans Clustered Points from Python
+Similarly we can use JPype to easily read the clustered points outputted by
+mahout.
+
+    def read_clustered_pts(ifile,*args,**param):
+     """Read the clustered points
+     ifile - path to the sequence file containing the clustered points
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     
+     
+     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=reader.getKeyClass()()
+     
+    
+     valcls=reader.getValueClass()
+     vecwritable=valcls()
+     while (reader.next(key,vecwritable)):	
+      weight=vecwritable.getWeight()
+      nvec=vecwritable.getVector()
+      
+      cname=nvec.__class__.__name__
+      if (cname.rsplit('.',1)[1]=="NamedVector"):  
+       print "cluster={key} Name={name} x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
+      else:
+       raise NotImplementedError("Vector isn't a NamedVector. Need tomodify/test the code to handle this case.")
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
+# Reading the KMeans Centroids
+Finally we can create a function to print out the actual cluster centers
+found by mahout,
+
+    def getClusters(ifile,*args,**param):
+     """Read the centroids from the clusters outputted by kmenas
+    	   ifile - Path to the sequence file containing the centroids
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     ReaderCls=io.__getattribute__("SequenceFile$Reader")
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=io.Text()
+     
+    
+     valcls=reader.getValueClass()
+    
+     vecwritable=valcls()
+     
+     while (reader.next(key,vecwritable)):	
+      center=vecwritable.getCenter()
+      
+      print "id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values)
+      pass
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md b/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
new file mode 100644
index 0000000..ad540c7
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: (Deprecated)  Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+
+# Introduction to ALS Recommendations with Hadoop
+
+##Overview
+
+Mahout’s ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It factors the user to item matrix *A* into the user-to-feature matrix *U* and the item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. The algorithm details can be referred to in the following papers: 
+
+* [Large-scale Parallel Collaborative Filtering for
+the Netflix Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+* [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433) 
+
+This recommendation algorithm can be used in eCommerce platform to recommend products to customers. Unlike the user or item based recommenders that computes the similarity of users or items to make recommendations, the ALS algorithm uncovers the latent factors that explain the observed user to item ratings and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings.
+
+Mahout's ALS recommendation algorithm takes as input user preferences by item and generates an output of recommending items for a user. The input customer preference could either be explicit user ratings or implicit feedback such as user's click on a web page.
+
+One of the strengths of the ALS based recommender, compared to the user or item based recommender, is its ability to handle large sparse data sets and its better prediction performance. It could also gives an intuitive rationale of the factors that influence recommendations.
+
+##Implementation
+At present Mahout has a map-reduce implementation of ALS, which is composed of 2 jobs: a parallel matrix factorization job and a recommendation job.
+The matrix factorization job computes the user-to-feature matrix and item-to-feature matrix given the user to item ratings. Its input includes: 
+<pre>
+    --input: directory containing files of explicit user to item rating or implicit feedback;
+    --output: output path of the user-feature matrix and feature-item matrix;
+    --lambda: regularization parameter to avoid overfitting;
+    --alpha: confidence parameter only used on implicit feedback
+    --implicitFeedback: boolean flag to indicate whether the input dataset contains implicit feedback;
+    --numFeatures: dimensions of feature space;
+    --numThreadsPerSolver: number of threads per solver mapper for concurrent execution;
+    --numIterations: number of iterations
+    --usesLongIDs: boolean flag to indicate whether the input contains long IDs that need to be translated
+</pre>
+and it outputs the matrices in sequence file format. 
+
+The recommendation job uses the user feature matrix and item feature matrix calculated from the factorization job to compute the top-N recommendations per user. Its input includes:
+<pre>
+    --input: directory containing files of user ids;
+    --output: output path of the recommended items for each input user id;
+    --userFeatures: path to the user feature matrix;
+    --itemFeatures: path to the item feature matrix;
+    --numRecommendations: maximum number of recommendations per user, default is 10;
+    --maxRating: maximum rating available;
+    --numThreads: number of threads per mapper;
+    --usesLongIDs: boolean flag to indicate whether the input contains long IDs that need to be translated;
+    --userIDIndex: index for user long IDs (necessary if usesLongIDs is true);
+    --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) 
+</pre>
+and it outputs a list of recommended item ids for each user. The predicted rating between user and item is a dot product of the user's feature vector and the item's feature vector.  
+
+##Example
+
+Let’s look at a simple example of how we could use Mahout’s ALS recommender to recommend items for users. First, you’ll need to get Mahout up and running, the instructions for which can be found [here](https://mahout.apache.org/users/basics/quickstart.html). After you've ensured Mahout is properly installed, we’re ready to run the example.
+
+**Step 1: Prepare test data**
+
+Similar to Mahout's item based recommender, the ALS recommender relies on the user to item preference data: *userID*, *itemID* and *preference*. The preference could be explicit numeric rating or counts of actions such as a click (implicit feedback). The test data file is organized as each line is a tab-delimited string, the 1st field is user id, which must be numeric, the 2nd field is item id, which must be numeric and the 3rd field is preference, which should also be a number.
+
+**Note:** You must create IDs that are ordinal positive integers for all user and item IDs. Often this will require you to keep a dictionary
+to map into and out of Mahout IDs. For instance if the first user has ID "xyz" in your application, this would get an Mahout ID of the integer 1 and so on. The same
+for item IDs. Then after recommendations are calculated you will have to translate the Mahout user and item IDs back into your application IDs.
+
+To quickly start, you could specify a text file like following as the input:
+<pre>
+1	100	1
+1	200	5
+1	400	1
+2	200	2
+2	300	1
+</pre>
+
+**Step 2: Determine parameters**
+
+In addition, users need to determine dimension of feature space, the number of iterations to run the alternating least square algorithm, Using 10 features and 15 iterations is a reasonable default to try first. Optionally a confidence parameter can be set if the input preference is implicit user feedback.  
+
+**Step 3: Run ALS**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
+
+    $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 --implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5  --numThreadsPerSolver 1 --tempDir tmp 
+
+Running the command will execute a series of jobs the final product of which will be an output file deposited to the output directory specified in the command syntax. The output directory contains 3 sub-directories: *M* stores the item to feature matrix, *U* stores the user to feature matrix and userRatings stores the user's ratings on the items. The *tempDir* parameter specifies the directory to store the intermediate output of the job, such as the matrix output in each iteration and each item's average rating. Using the *tempDir* will help on debugging.
+
+**Step 4: Make Recommendations**
+
+Based on the output feature matrices from step 3, we could make recommendations for users. Enter the following command:
+
+     $ mahout recommendfactorized --input $als_recommender_input --userFeatures $als_output/U/ --itemFeatures $als_output/M/ --numRecommendations 1 --output recommendations --maxRating 1
+
+The input user file is a sequence file, the sequence record key is user id and value is the user's rated item ids which will be removed from recommendation. The output file generated in our simple example will be a text file giving the recommended item ids for each user. 
+Remember to translate the Mahout ids back into your application specific ids. 
+
+There exist a variety of parameters for Mahout’s ALS recommender to accommodate custom business requirements; exploring and testing various configurations to suit your needs will doubtless lead to additional questions. Feel free to ask such questions on the [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
+


Mime
View raw message