http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringofsyntheticcontroldata.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringofsyntheticcontroldata.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringofsyntheticcontroldata.md
deleted file mode 100644
index 693568f..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringofsyntheticcontroldata.md
+++ /dev/null
@@ 1,53 +0,0 @@

layout: default
title: Clustering of synthetic control data
theme:
 name: retromahout


# Clustering synthetic control data

## Introduction

This example will demonstrate clustering of time series data, specifically control charts. [Control charts](http://en.wikipedia.org/wiki/Control_chart) are tools used to determine whether a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated repeatedly at equal time intervals. A [simulated dataset](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html) is available for use in UCI machine learning repository.

A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and is meant to resemble real world information in an anonymized format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In this example we will use Mahout to cluster the data into corresponding class buckets.

*For the sake of simplicity, we won't use a cluster in this example, but instead show you the commands to run the clustering examples locally with Hadoop*.

## Setup

We need to do some initial setup before we are able to run the example.


 1. Start out by downloading the dataset to be clustered from the UCI Machine Learning Repository: [http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data).

 2. Download the [latest release of Mahout](/general/downloads.html).

 3. Unpack the release binary and switch to the *mahoutdistribution0.x* folder

 4. Make sure that the *JAVA_HOME* environment variable points to your local java installation

 5. Create a folder called *testdata* in the current directory and copy the dataset into this folder.


## Clustering Examples

Depending on the clustering algorithm you want to run, the following commands can be used:


 * [Canopy Clustering](/users/clustering/canopyclustering.html)

 bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job

 * [kMeans Clustering](/users/clustering/kmeansclustering.html)

 bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job


 * [Fuzzy kMeans Clustering](/users/clustering/fuzzykmeans.html)

 bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job

The clustering output will be produced in the *output* directory. The output data points are in vector format. In order to read/analyze the output, you can use the [clusterdump](/users/clustering/clusterdumper.html) utility provided by Mahout.

http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringseinfeldepisodes.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringseinfeldepisodes.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringseinfeldepisodes.md
deleted file mode 100644
index 8a983fc..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringseinfeldepisodes.md
+++ /dev/null
@@ 1,11 +0,0 @@

layout: default
title: Clustering Seinfeld Episodes
theme:
 name: retromahout


Below is short tutorial on how to cluster Seinfeld episode transcripts with
Mahout.

http://blog.jteam.nl/2011/04/04/howtoclusterseinfeldepisodeswithmahout/
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringyourdata.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringyourdata.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringyourdata.md
deleted file mode 100644
index bdd6d01..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/clusteringyourdata.md
+++ /dev/null
@@ 1,126 +0,0 @@

layout: default
title: ClusteringYourData
theme:
 name: retromahout


# Clustering your data

After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own
data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.

The following pieces *may* be useful for in getting started:

<a name="ClusteringYourDataInput"></a>
# Input

For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creatingvectors.html).
In particular for text preparation check out [Creating Vectors from Text](../basics/creatingvectorsfromtext.html).


<a name="ClusteringYourDataRunningtheProcess"></a>
# Running the Process

* [Canopy background](canopyclustering.html) and [canopycommandline](canopycommandline.html).

* [KMeans background](kmeansclustering.html), [kmeanscommandline](kmeanscommandline.html), and
[fuzzykmeanscommandline](fuzzykmeanscommandline.html).

* [Dirichlet background](dirichletprocessclustering.html) and [dirichletcommandline](dirichletcommandline.html).

* [Meanshift background](meanshiftclustering.html) and [meanshiftcommandline](meanshiftcommandline.html).

* [LDA (Latent Dirichlet Allocation) background](latentdirichletallocation.html) and [ldacommandline](ldacommandline.html).

* TODO: kmeans++/ streaming kMeans documentation


<a name="ClusteringYourDataRetrievingtheOutput"></a>
# Retrieving the Output

Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.

 ./bin/mahout clusterdump <OPTIONS>


<a name="ClusteringYourDataTheclusterdumperoptionsare:"></a>
## The cluster dumper options are:

 help (h) Print out help

 input (i) input The directory containing Sequence
 Files for the Clusters

 output (o) output The output file. If not specified,
 dumps to the console.

 outputFormat (of) outputFormat The optional output format to write
 the results as. Options: TEXT, CSV, or GRAPH_ML

 substring (b) substring The number of chars of the
 asFormatString() to print

 pointsDir (p) pointsDir The directory containing points
 sequence files mapping input vectors to their cluster. If specified,
 then the program will output the
 points associated with a cluster

 dictionary (d) dictionary The dictionary file.

 dictionaryType (dt) dictionaryType The dictionary file type
 (textsequencefile)

 distanceMeasure (dm) distanceMeasure The classname of the DistanceMeasure.
 Default is SquaredEuclidean.

 numWords (n) numWords The number of top terms to print

 tempDir tempDir Intermediate output directory

 startPhase startPhase First phase to run

 endPhase endPhase Last phase to run

 evaluate (e) Run ClusterEvaluator and CDbwEvaluator over the
 input. The output will be appended to the rest of
 the output at the end.


More information on using clusterdump utility can be found [here](clusterdumper.html)

<a name="ClusteringYourDataValidatingtheOutput"></a>
# Validating the Output

{quote}
Ted Dunning: A principled approach to cluster evaluation is to measure how well the
cluster membership captures the structure of unseen data. A natural
measure for this is to measure how much of the entropy of the data is
captured by cluster membership. For kmeans and its natural L_2 metric,
the natural cluster quality metric is the squared distance from the nearest
centroid adjusted by the log_2 of the number of clusters. This can be
compared to the squared magnitude of the original data or the squared
deviation from the centroid for all of the data. The idea is that you are
changing the representation of the data by allocating some of the bits in
your original representation to represent which cluster each point is in.
If those bits aren't made up by the residue being small then your
clustering is making a bad tradeoff.

In the past, I have used other more heuristic measures as well. One of the
key characteristics that I would like to see out of a clustering is a
degree of stability. Thus, I look at the fractions of points that are
assigned to each cluster or the distribution of distances from the cluster
centroid. These values should be relatively stable when applied to heldout
data.

For text, you can actually compute perplexity which measures how well
cluster membership predicts what words are used. This is nice because you
don't have to worry about the entropy of real valued numbers.

Manual inspection and the socalled laugh test is also important. The idea
is that the results should not be so ludicrous as to make you laugh.
Unfortunately, it is pretty easy to kid yourself into thinking your system
is working using this kind of inspection. The problem is that we are too
good at seeing (making up) patterns.
{quote}

http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/expectationmaximization.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/expectationmaximization.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/expectationmaximization.md
deleted file mode 100644
index 6ccc8c3..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/expectationmaximization.md
+++ /dev/null
@@ 1,62 +0,0 @@

layout: default
title: Expectation Maximization
theme:
 name: retromahout

<a name="ExpectationMaximizationExpectationMaximization"></a>
# Expectation Maximization

The principle of EM can be applied to several learning settings, but is
most commonly associated with clustering. The main principle of the
algorithm is comparable to kMeans. Yet in contrast to hard cluster
assignments, each object is given some probability to belong to a cluster.
Accordingly cluster centers are recomputed based on the average of all
objects weighted by their probability of belonging to the cluster at hand.

<a name="ExpectationMaximizationCanopymodifiedEM"></a>
## Canopymodified EM

One can also use the canopies idea to speed up prototypebased clustering
methods like Kmeans and ExpectationMaximization (EM). In general, neither
Kmeans nor EMspecify how many clusters to use. The canopies technique does
not help this choice.

Prototypes (our estimates of the cluster centroids) are associated with the
canopies that contain them, and the prototypes are only influenced by data
that are inside their associated canopies. After creating the canopies, we
decide how many prototypes will be created for each canopy. This could be
done, for example, using the number of data points in a canopy and AIC or
BIC where points that occur in more than one canopy are counted
fractionally. Then we place prototypesinto each canopy. This initial
placement can be random, as long as it is within the canopy in question, as
determined by the inexpensive distance metric.

Then, instead of calculating the distance from each prototype to every
point (as is traditional, a O(nk) operation), theEstep instead calculates
the distance from each prototype to a much smaller number of points. For
each prototype, we find the canopies that contain it (using the cheap
distance metric), and only calculate distances (using the expensive
distance metric) from that prototype to points within those canopies.

Note that by this procedure prototypes may move across canopy boundaries
when canopies overlap. Prototypes may move to cover the data in the
overlapping region, and then move entirely into another canopy in order to
cover data there.

The canopymodified EM algorithm behaves very similarly to traditional EM,
with the slight difference that points outside the canopy have no influence
on points in the canopy, rather than a minute influence. If the canopy
property holds, and points in the same cluster fall in the same canopy,
then the canopymodified EM will almost always converge to the same maximum
in likelihood as the traditional EM. In fact, the difference in each
iterative step (apart from the enormous computational savings of computing
fewer terms) will be negligible since points outside the canopy will have
exponentially small influence.

<a name="ExpectationMaximizationStrategyforParallelization"></a>
## Strategy for Parallelization

<a name="ExpectationMaximizationMap/ReduceImplementation"></a>
## Map/Reduce Implementation

http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeanscommandline.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeanscommandline.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeanscommandline.md
deleted file mode 100644
index 1374682..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeanscommandline.md
+++ /dev/null
@@ 1,97 +0,0 @@

layout: default
title: fuzzykmeanscommandline
theme:
 name: retromahout


<a name="fuzzykmeanscommandlineRunningFuzzykMeansClusteringfromtheCommandLine"></a>
# Running Fuzzy kMeans Clustering from the Command Line
Mahout's Fuzzy kMeans clustering can be launched from the same command
line invocation whether you are running on a single machine in standalone
mode or on a larger Hadoop cluster. The difference is determined by the
$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
an operating Hadoop cluster on the target machine then the invocation will
run FuzzyK on that cluster. If either of the environment variables are
missing then the standalone Hadoop configuration will be invoked instead.


 ./bin/mahout fkmeans <OPTIONS>


* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahoutcore0.3.job


<a name="fuzzykmeanscommandlineTestingitononesinglemachinew/ocluster"></a>
## Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job:

 ./bin/mahout fkmeans i testdata <OPTIONS>


<a name="fuzzykmeanscommandlineRunningitonthecluster"></a>
## Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/startall.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs put <PATH TO DATA> testdata
* Run the Job:

 export HADOOP_HOME=<Hadoop Home Directory>
 export HADOOP_CONF_DIR=$HADOOP_HOME/conf
 ./bin/mahout fkmeans i testdata <OPTIONS>

* Get the data out of HDFS and have a look. Use bin/hadoop fs lsr output
to view all outputs.

<a name="fuzzykmeanscommandlineCommandlineoptions"></a>
# Command line options

 input (i) input Path to job input directory.
 Must be a SequenceFile of
 VectorWritable
 clusters (c) clusters The input centroids, as Vectors.
 Must be a SequenceFile of
 Writable, Cluster/Canopy. If k
 is also specified, then a random
 set of vectors will be selected
 and written out to this path
 first
 output (o) output The directory pathname for
 output.
 distanceMeasure (dm) distanceMeasure The classname of the
 DistanceMeasure. Default is
 SquaredEuclidean
 convergenceDelta (cd) convergenceDelta The convergence delta value.
 Default is 0.5
 maxIter (x) maxIter The maximum number of
 iterations.
 k (k) k The k in kMeans. If specified,
 then a random selection of k
 Vectors will be chosen as the
 Centroid and written to the
 clusters input path.
 m (m) m coefficient normalization
 factor, must be greater than 1
 overwrite (ow) If present, overwrite the output
 directory before running job
 help (h) Print out help
 numMap (u) numMap The number of map tasks.
 Defaults to 10
 maxRed (r) maxRed The number of reduce tasks.
 Defaults to 2
 emitMostLikely (e) emitMostLikely True if clustering should emit
 the most likely point only,
 false for threshold clustering.
 Default is true
 threshold (t) threshold The pdf threshold used for
 cluster determination. Default
 is 0
 clustering (cl) If present, run clustering after
 the iterations have taken place


http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeans.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeans.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeans.md
deleted file mode 100644
index ec53e62..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/fuzzykmeans.md
+++ /dev/null
@@ 1,186 +0,0 @@

layout: default
title: Fuzzy KMeans
theme:
 name: retromahout


# Fuzzy KMeans

Fuzzy KMeans (also called Fuzzy CMeans) is an extension of [KMeans](http://mahout.apache.org/users/clustering/kmeansclustering.html)
, the popular simple clustering technique. While KMeans discovers hard
clusters (a point belong to only one cluster), Fuzzy KMeans is a more
statistically formalized method and discovers soft clusters where a
particular point can belong to more than one cluster with certain
probability.

<a name="FuzzyKMeansAlgorithm"></a>
#### Algorithm

Like KMeans, Fuzzy KMeans works on those objects which can be represented
in ndimensional vector space and a distance measure is defined.
The algorithm is similar to kmeans.

* Initialize k clusters
* Until converged
 * Compute the probability of a point belong to a cluster for every <point,cluster> pair
 * Recompute the cluster centers using above probability membership values of points to clusters

<a name="FuzzyKMeansDesignImplementation"></a>
#### Design Implementation

The design is similar to KMeans present in Mahout. It accepts an input
file containing vector points. User can either provide the cluster centers
as input or can allow canopy algorithm to run and create initial clusters.

Similar to KMeans, the program doesn't modify the input directories. And
for every iteration, the cluster output is stored in a directory clusterN.
The code has set number of reduce tasks equal to number of map tasks. So,
those many part0


Files are created in clusterN directory. The code uses
driver/mapper/combiner/reducer as follows:

FuzzyKMeansDriver  This is similar to KMeansDriver. It iterates over
input points and cluster points for specified number of iterations or until
it is converged.During every iteration i, a new clusteri directory is
created which contains the modified cluster centers obtained during
FuzzyKMeans iteration. This will be feeded as input clusters in the next
iteration. Once Fuzzy KMeans is run for specified number of
iterations or until it is converged, a map task is run to output "the point
and the cluster membership to each cluster" pair as final output to a
directory named "points".

FuzzyKMeansMapper  reads the input cluster during its configure() method,
then computes cluster membership probability of a point to each
cluster.Cluster membership is inversely propotional to the distance.
Distance is computed using user supplied distance measure. Output key
is encoded clusterId. Output values are ClusterObservations containing
observation statistics.

FuzzyKMeansCombiner  receives all key:value pairs from the mapper and
produces partial sums of the cluster membership probability times input
vectors for each cluster. Output key is: encoded cluster identifier. Output
values are ClusterObservations containing observation statistics.

FuzzyKMeansReducer  Multiple reducers receives certain keys and all values
associated with those keys. The reducer sums the values to produce a new
centroid for the cluster which is output. Output key is: encoded cluster
identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g.
"C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and
converged clusters with 'Vn' clusterId.

<a name="FuzzyKMeansRunningFuzzykMeansClustering"></a>
## Running Fuzzy kMeans Clustering

The Fuzzy kMeans clustering algorithm may be run using a commandline
invocation on FuzzyKMeansDriver.main or by making a Java call to
FuzzyKMeansDriver.run().

Invocation using the command line takes the form:


 bin/mahout fkmeans \
 i <input vectors directory> \
 c <input clusters directory> \
 o <output working directory> \
 dm <DistanceMeasure> \
 m <fuzziness argument >1> \
 x <maximum number of iterations> \
 k <optional number of initial clusters to sample from input vectors> \
 cd <optional convergence delta. Default is 0.5> \
 ow <overwrite output directory if present>
 cl <run input vector clustering after computing Clusters>
 e <emit vectors to most likely cluster during clustering>
 t <threshold to use for clustering if e is false>
 xm <execution method: sequential or mapreduce>


*Note:* if the k argument is supplied, any clusters in the c directory
will be overwritten and k random points will be sampled from the input
vectors to become the initial cluster centers.

Invocation using Java involves supplying the following arguments:

1. input: a file path string to a directory containing the input data set a
SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
is not used.
1. clustersIn: a file path string to a directory containing the initial
clusters, a SequenceFile(key, SoftCluster  Cluster  Canopy). Fuzzy
kMeans SoftClusters, kMeans Clusters and Canopy Canopies may be used for
the initial clusters.
1. output: a file path string to an empty directory which is used for all
output from the algorithm.
1. measure: the fullyqualified class name of an instance of DistanceMeasure
which will be used for the clustering.
1. convergence: a double value used to determine if the algorithm has
converged (clusters have not moved more than the value in the last
iteration)
1. maxiterations: the maximum number of iterations to run, independent of
the convergence specified
1. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is
equivalent to normalising the coefficient linearly to make their sum 1.
When m is close to 1, then the cluster center closest to the point is given
much more weight than the others, and the algorithm is similar to kmeans.
1. runClustering: a boolean indicating, if true, that the clustering step is
to be executed after clusters have been determined.
1. emitMostLikely: a boolean indicating, if true, that the clustering step
should only emit the most likely cluster for each clustered point.
1. threshold: a double indicating, if emitMostLikely is false, the cluster
probability threshold used for emitting multiple clusters for each point. A
value of 0 will emit all clusters with their associated probabilities for
each vector.
1. runSequential: a boolean indicating, if true, that the algorithm is to
use the sequential reference implementation running in memory.

After running the algorithm, the output directory will contain:
1. clustersN: directories containing SequenceFiles(Text, SoftCluster)
produced by the algorithm for each iteration. The Text _key_ is a cluster
identifier string.
1. clusteredPoints: (if runClustering enabled) a directory containing
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
the clusterId. The WeightedVectorWritable _value_ is a bean containing a
double _weight_ and a VectorWritable _vector_ where the weights are
computed as 1/(1+distance) where the distance is between the cluster center
and the vector using the chosen DistanceMeasure.

<a name="FuzzyKMeansExamples"></a>
# Examples

The following images illustrate Fuzzy kMeans clustering applied to a set
of randomlygenerated 2d data points. The points are generated using a
normal distribution centered at a mean location and with a constant
standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
 for details on running similar examples.

The points are generated as follows:

* 500 samples m=\[1.0, 1.0\](1.0,1.0\.html)
 sd=3.0
* 300 samples m=\[1.0, 0.0\](1.0,0.0\.html)
 sd=0.5
* 300 samples m=\[0.0, 2.0\](0.0,2.0\.html)
 sd=0.1

In the first image, the points are plotted and the 3sigma boundaries of
their generator are superimposed.

![fuzzy](../../images/SampleData.png)

In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As Fuzzy kMeans is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in \[orange, yellow, green, blue, violet and gray\](orange,yellow,green,blue,violetandgray\.html)
. Although it misses a lot of the points and cannot capture the original,
superimposed cluster centers, it does a decent job of clustering this data.

![fuzzy](../../images/FuzzyKMeans.png)

The third image shows the results of running Fuzzy kMeans on a different
data set which is generated using asymmetrical standard deviations.
Fuzzy kMeans does a fair job handling this data set as well.

![fuzzy](../../images/2dFuzzyKMeans.png)

<a name="FuzzyKMeansReferences "></a>
#### References

* [http://en.wikipedia.org/wiki/Fuzzy_clustering](http://en.wikipedia.org/wiki/Fuzzy_clustering)
\ No newline at end of file
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/hierarchicalclustering.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/hierarchicalclustering.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/hierarchicalclustering.md
deleted file mode 100644
index 6c541cc..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/hierarchicalclustering.md
+++ /dev/null
@@ 1,15 +0,0 @@

layout: default
title: Hierarchical Clustering
theme:
 name: retromahout

Hierarchical clustering is the process or finding bigger clusters, and also
the smaller clusters inside the bigger clusters.

In Apache Mahout, separate algorithms can be used for finding clusters at
different levels.

See [Top Down Clustering](https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering)
.

http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeansclustering.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeansclustering.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeansclustering.md
deleted file mode 100644
index 5c25763..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeansclustering.md
+++ /dev/null
@@ 1,182 +0,0 @@

layout: default
title: KMeans Clustering
theme:
 name: retromahout


# kMeans clustering  basics

[kMeans](http://en.wikipedia.org/wiki/Kmeans) is a simple but wellknown algorithm for grouping objects, clustering. All objects need to be represented
as a set of numerical features. In addition, the user has to specify the
number of groups (referred to as *k*) she wishes to identify.

Each object can be thought of as being represented by some feature vector
in an _n_ dimensional space, _n_ being the number of all features used to
describe the objects to cluster. The algorithm then randomly chooses _k_
points in that vector space, these point serve as the initial centers of
the clusters. Afterwards all objects are each assigned to the center they
are closest to. Usually the distance measure is chosen by the user and
determined by the learning task.

After that, for each cluster a new center is computed by averaging the
feature vectors of all objects assigned to it. The process of assigning
objects and recomputing centers is repeated until the process converges.
The algorithm can be proven to converge after a finite number of
iterations.

Several tweaks concerning distance measure, initial center choice and
computation of new average centers have been explored, as well as the
estimation of the number of clusters _k_. Yet the main principle always
remains the same.



<a name="KMeansClusteringQuickstart"></a>
## Quickstart

[Here](https://github.com/apache/mahout/blob/master/examples/bin/clusterreuters.sh)
 is a short shell script outline that will get you started quickly with
kmeans. This does the following:

* Accepts clustering type: *kmeans*, *fuzzykmeans*, *lda*, or *streamingkmeans*
* Gets the Reuters dataset
* Runs org.apache.lucene.benchmark.utils.ExtractReuters to generate
reutersout from reuterssgm (the downloaded archive)
* Runs seqdirectory to convert reutersout to SequenceFile format
* Runs seq2sparse to convert SequenceFiles to sparse vector format
* Runs kmeans with 20 clusters
* Runs clusterdump to show results

After following through the output that scrolls past, reading the code will
offer you a better understanding.


<a name="KMeansClusteringDesignofimplementation"></a>
## Implementation

The implementation accepts two input directories: one for the data points
and one for the initial clusters. The data directory contains multiple
input files of SequenceFile(Key, VectorWritable), while the clusters
directory contains one or more SequenceFiles(Text, Cluster)
containing _k_ initial clusters or canopies. None of the input directories
are modified by the implementation, allowing experimentation with initial
clustering and convergence values.

Canopy clustering can be used to compute the initial clusters for kKMeans:

 // run the CanopyDriver job
 CanopyDriver.runJob("testdata", "output"
 ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);

 // now run the KMeansDriver job
 KMeansDriver.runJob("testdata", "output/clusters0", "output",
 EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);


In the above example, the input data points are stored in 'testdata' and
the CanopyDriver is configured to output to the 'output/clusters0'
directory. Once the driver executes it will contain the canopy definition
files. Upon running the KMeansDriver the output directory will have two or
more new directories: 'clustersN'' containining the clusters for each
iteration and 'clusteredPoints' will contain the clustered data points.

This diagram shows the examplary dataflow of the kMeans example
implementation provided by Mahout:
<img src="../../images/Example implementation of kMeans provided with Mahout.png">


<a name="KMeansClusteringRunningkMeansClustering"></a>
## Running kMeans Clustering

The kMeans clustering algorithm may be run using a commandline invocation
on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().

Invocation using the command line takes the form:


 bin/mahout kmeans \
 i <input vectors directory> \
 c <input clusters directory> \
 o <output working directory> \
 k <optional number of initial clusters to sample from input vectors> \
 dm <DistanceMeasure> \
 x <maximum number of iterations> \
 cd <optional convergence delta. Default is 0.5> \
 ow <overwrite output directory if present>
 cl <run input vector clustering after computing Canopies>
 xm <execution method: sequential or mapreduce>


Note: if the \k argument is supplied, any clusters in the \c directory
will be overwritten and \k random points will be sampled from the input
vectors to become the initial cluster centers.

Invocation using Java involves supplying the following arguments:

1. input: a file path string to a directory containing the input data set a
SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
is not used.
1. clusters: a file path string to a directory containing the initial
clusters, a SequenceFile(key, Cluster \ Canopy). Both KMeans clusters and
Canopy canopies may be used for the initial clusters.
1. output: a file path string to an empty directory which is used for all
output from the algorithm.
1. distanceMeasure: the fullyqualified class name of an instance of
DistanceMeasure which will be used for the clustering.
1. convergenceDelta: a double value used to determine if the algorithm has
converged (clusters have not moved more than the value in the last
iteration)
1. maxIter: the maximum number of iterations to run, independent of the
convergence specified
1. runClustering: a boolean indicating, if true, that the clustering step is
to be executed after clusters have been determined.
1. runSequential: a boolean indicating, if true, that the kmeans sequential
implementation is to be used to process the input data.

After running the algorithm, the output directory will contain:
1. clustersN: directories containing SequenceFiles(Text, Cluster) produced
by the algorithm for each iteration. The Text _key_ is a cluster identifier
string.
1. clusteredPoints: (if \clustering enabled) a directory containing
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
the clusterId. The WeightedVectorWritable _value_ is a bean containing a
double _weight_ and a VectorWritable _vector_ where the weight indicates
the probability that the vector is a member of the cluster. For kMeans
clustering, the weights are computed as 1/(1+distance) where the distance
is between the cluster center and the vector using the chosen
DistanceMeasure.

<a name="KMeansClusteringExamples"></a>
# Examples

The following images illustrate kMeans clustering applied to a set of
randomlygenerated 2d data points. The points are generated using a normal
distribution centered at a mean location and with a constant standard
deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
 for details on running similar examples.

The points are generated as follows:

* 500 samples m=\[1.0, 1.0\](1.0,1.0\.html)
 sd=3.0
* 300 samples m=\[1.0, 0.0\](1.0,0.0\.html)
 sd=0.5
* 300 samples m=\[0.0, 2.0\](0.0,2.0\.html)
 sd=0.1

In the first image, the points are plotted and the 3sigma boundaries of
their generator are superimposed.

![Sample data graph](../../images/SampleData.png)

In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As kMeans is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in \[orange, yellow, green, blue, violet and gray\](orange,yellow,green,blue,violetandgray\.html)
. Although it misses a lot of the points and cannot capture the original,
superimposed cluster centers, it does a decent job of clustering this data.

![kmeans](../../images/KMeans.png)

The third image shows the results of running kMeans on a different dataset, which is generated using asymmetrical standard deviations.
KMeans does a fair job handling this data set as well.

![2d kmeans](../../images/2dKMeans.png)
\ No newline at end of file
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeanscommandline.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeanscommandline.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeanscommandline.md
deleted file mode 100644
index 8d802f8..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/kmeanscommandline.md
+++ /dev/null
@@ 1,94 +0,0 @@

layout: default
title: kmeanscommandline
theme:
 name: retromahout


<a name="kmeanscommandlineIntroduction"></a>
# kMeans commandline introduction

This quick start page describes how to run the kMeans clustering algorithm
on a Hadoop cluster.

<a name="kmeanscommandlineSteps"></a>
# Steps

Mahout's kMeans clustering can be launched from the same command line
invocation whether you are running on a single machine in standalone mode
or on a larger Hadoop cluster. The difference is determined by the
$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
an operating Hadoop cluster on the target machine then the invocation will
run kMeans on that cluster. If either of the environment variables are
missing then the standalone Hadoop configuration will be invoked instead.


 ./bin/mahout kmeans <OPTIONS>


In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahoutcore0.3.job


<a name="kmeanscommandlineTestingitononesinglemachinew/ocluster"></a>
## Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job:

 ./bin/mahout kmeans i testdata o output c clusters dm
org.apache.mahout.common.distance.CosineDistanceMeasure x 5 ow cd 1 k
25


<a name="kmeanscommandlineRunningitonthecluster"></a>
## Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/startall.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs put <PATH TO DATA> testdata
* Run the Job:

 export HADOOP_HOME=<Hadoop Home Directory>
 export HADOOP_CONF_DIR=$HADOOP_HOME/conf
 ./bin/mahout kmeans i testdata o output c clusters dm org.apache.mahout.common.distance.CosineDistanceMeasure x 5 ow cd 1 k 25

* Get the data out of HDFS and have a look. Use bin/hadoop fs lsr output
to view all outputs.

<a name="kmeanscommandlineCommandlineoptions"></a>
# Command line options

 input (i) input Path to job input directory.
 Must be a SequenceFile of
 VectorWritable
 clusters (c) clusters The input centroids, as Vectors.
 Must be a SequenceFile of
 Writable, Cluster/Canopy. If k
 is also specified, then a random
 set of vectors will be selected
 and written out to this path
 first
 output (o) output The directory pathname for
 output.
 distanceMeasure (dm) distanceMeasure The classname of the
 DistanceMeasure. Default is
 SquaredEuclidean
 convergenceDelta (cd) convergenceDelta The convergence delta value.
 Default is 0.5
 maxIter (x) maxIter The maximum number of
 iterations.
 maxRed (r) maxRed The number of reduce tasks.
 Defaults to 2
 k (k) k The k in kMeans. If specified,
 then a random selection of k
 Vectors will be chosen as the
 Centroid and written to the
 clusters input path.
 overwrite (ow) If present, overwrite the output
 directory before running job
 help (h) Print out help
 clustering (cl) If present, run clustering after
 the iterations have taken place

http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/latentdirichletallocation.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/latentdirichletallocation.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/latentdirichletallocation.md
deleted file mode 100644
index 871cea2..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/latentdirichletallocation.md
+++ /dev/null
@@ 1,155 +0,0 @@

layout: default
title: Latent Dirichlet Allocation
theme:
 name: retromahout


<a name="LatentDirichletAllocationOverview"></a>
# Overview

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
algorithm for automatically and jointly clustering words into "topics" and
documents into mixtures of topics. It has been successfully applied to
model change in scientific fields over time (Griffiths and Steyvers, 2004;
Hall, et al. 2008).

A topic model is, roughly, a hierarchical Bayesian model that associates
with each document a probability distribution over "topics", which are in
turn distributions over words. For instance, a topic in a collection of
newswire might include words about "sports", such as "baseball", "home
run", "player", and a document about steroid use in baseball might include
"sports", "drugs", and "politics". Note that the labels "sports", "drugs",
and "politics", are posthoc labels assigned by a human, and that the
algorithm itself only assigns associate words with probabilities. The task
of parameter estimation in these models is to learn both what the topics
are, and which documents employ them in what proportions.

Another way to view a topic model is as a generalization of a mixture model
like [Dirichlet Process Clustering](http://en.wikipedia.org/wiki/Dirichlet_process)
. Starting from a normal mixture model, in which we have a single global
mixture of several distributions, we instead say that _each_ document has
its own mixture distribution over the globally shared mixture components.
Operationally in Dirichlet Process Clustering, each document has its own
latent variable drawn from a global mixture that specifies which model it
belongs to, while in LDA each word in each document has its own parameter
drawn from a documentwide mixture.

The idea is that we use a probabilistic mixture of a number of models that
we use to explain some observed data. Each observed data point is assumed
to have come from one of the models in the mixture, but we don't know
which. The way we deal with that is to use a socalled latent parameter
which specifies which model each data point came from.

<a name="LatentDirichletAllocationCollapsedVariationalBayes"></a>
# Collapsed Variational Bayes
The CVB algorithm which is implemented in Mahout for LDA combines
advantages of both regular Variational Bayes and Gibbs Sampling. The
algorithm relies on modeling dependence of parameters on latest variables
which are in turn mutually independent. The algorithm uses 2
methodologies to marginalize out parameters when calculating the joint
distribution and the other other is to model the posterior of theta and phi
given the inputs z and x.

A common solution to the CVB algorithm is to compute each expectation term
by using simple Gaussian approximation which is accurate and requires low
computational overhead. The specifics behind the approximation involve
computing the sum of the means and variances of the individual Bernoulli
variables.

CVB with Gaussian approximation is implemented by tracking the mean and
variance and subtracting the mean and variance of the corresponding
Bernoulli variables. The computational cost for the algorithm scales on
the order of O(K) with each update to q(z(i,j)). Also for each
document/word pair only 1 copy of the variational posterior is required
over the latent variable.

<a name="LatentDirichletAllocationInvocationandUsage"></a>
# Invocation and Usage

Mahout's implementation of LDA operates on a collection of SparseVectors of
word counts. These word counts should be nonnegative integers, though
things will probably work fine if you use nonnegative reals. (Note
that the probabilistic model doesn't make sense if you do!) To create these
vectors, it's recommended that you follow the instructions in [Creating Vectors From Text](../basics/creatingvectorsfromtext.html)
, making sure to use TF and not TFIDF as the scorer.

Invocation takes the form:


 bin/mahout cvb \
 i <input path for document vectors> \
 dict <path to termdictionary file(s) , glob expression supported> \
 o <output path for topicterm distributions>
 dt <output path for doctopic distributions> \
 k <number of latent topics> \
 nt <number of unique features defined by input document vectors> \
 mt <path to store model state after each iteration> \
 maxIter <max number of iterations> \
 mipd <max number of iterations per doc for learning> \
 a <smoothing for doc topic distributions> \
 e <smoothing for term topic distributions> \
 seed <random seed> \
 tf <fraction of data to hold for testing> \
 block <number of iterations per perplexity check, ignored unless
test_set_percentage>0> \


Topic smoothing should generally be about 50/K, where K is the number of
topics. The number of words in the vocabulary can be an upper bound, though
it shouldn't be too high (for memory concerns).

Choosing the number of topics is more art than science, and it's
recommended that you try several values.

After running LDA you can obtain an output of the computed topics using the
LDAPrintTopics utility:


 bin/mahout ldatopics \
 i <input vectors directory> \
 d <input dictionary file> \
 w <optional number of words to print> \
 o <optional output working directory. Default is to console> \
 h <print out help> \
 dt <optional dictionary type (textsequencefile). Default is text>



<a name="LatentDirichletAllocationExample"></a>
# Example

An example is located in mahout/examples/bin/buildreuters.sh. The script
automatically downloads the Reuters21578 corpus, builds a Lucene index and
converts the Lucene index to vectors. By uncommenting the last two lines
you can then cause it to run LDA on the vectors and finally print the
resultant topics to the console.

To adapt the example yourself, you should note that Lucene has specialized
support for Reuters, and that building your own index will require some
adaptation. The rest should hopefully not differ too much.

<a name="LatentDirichletAllocationParameterEstimation"></a>
# Parameter Estimation

We use mean field variational inference to estimate the models. Variational
inference can be thought of as a generalization of [EM](expectationmaximization.html)
 for hierarchical Bayesian models. The EStep takes the form of, for each
document, inferring the posterior probability of each topic for each word
in each document. We then take the sufficient statistics and emit them in
the form of (log) pseudocounts for each word in each topic. The MStep is
simply to sum these together and (log) normalize them so that we have a
distribution over the entire vocabulary of the corpus for each topic.

In implementation, the EStep is implemented in the Map, and the MStep is
executed in the reduce step, with the final normalization happening as a
postprocessing step.

<a name="LatentDirichletAllocationReferences"></a>
# References

[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent Dirichlet Allocation. JMLR.](http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf)

[Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS. ](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)

[David Hall, Dan Jurafsky, and Christopher D. Manning. 2008. Studying the History of Ideas Using Topic Models ](http://aclweb.org/anthology//D/D08/D081038.pdf)
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/ldacommandline.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/ldacommandline.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/ldacommandline.md
deleted file mode 100644
index 613e90b..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/ldacommandline.md
+++ /dev/null
@@ 1,83 +0,0 @@

layout: default
title: ldacommandline
theme:
 name: retromahout


<a name="ldacommandlineRunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
# Running Latent Dirichlet Allocation (algorithm) from the Command Line
[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT897)
 lda has been implemented as Collapsed Variable Bayes (cvb).

Mahout's LDA can be launched from the same command line invocation whether
you are running on a single machine in standalone mode or on a larger
Hadoop cluster. The difference is determined by the $HADOOP_HOME and
$HADOOP_CONF_DIR environment variables. If both are set to an operating
Hadoop cluster on the target machine then the invocation will run the LDA
algorithm on that cluster. If either of the environment variables are
missing then the standalone Hadoop configuration will be invoked instead.



 ./bin/mahout cvb <OPTIONS>


* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahoutcore0.3.job


<a name="ldacommandlineTestingitononesinglemachinew/ocluster"></a>
## Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job:

 ./bin/mahout cvb i testdata <OTHER OPTIONS>


<a name="ldacommandlineRunningitonthecluster"></a>
## Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/startall.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs put <PATH TO DATA> testdata
* Run the Job:

 export HADOOP_HOME=<Hadoop Home Directory>
 export HADOOP_CONF_DIR=$HADOOP_HOME/conf
 ./bin/mahout cvb i testdata <OTHER OPTIONS>

* Get the data out of HDFS and have a look. Use bin/hadoop fs lsr output
to view all outputs.

<a name="ldacommandlineCommandlineoptionsfromMahoutcvbversion0.8"></a>
# Command line options from Mahout cvb version 0.8

 mahout cvb h
 input (i) input Path to job input directory.
 output (o) output The directory pathname for output.
 maxIter (x) maxIter The maximum number of iterations.
 convergenceDelta (cd) convergenceDelta The convergence delta value
 overwrite (ow) If present, overwrite the output directory before running job
 num_topics (k) num_topics Number of topics to learn
 num_terms (nt) num_terms Vocabulary size
 doc_topic_smoothing (a) doc_topic_smoothing Smoothing for document/topic distribution
 term_topic_smoothing (e) term_topic_smoothing Smoothing for topic/term distribution
 dictionary (dict) dictionary Path to termdictionary file(s) (glob expression supported)
 doc_topic_output (dt) doc_topic_output Output path for the training doc/topic distribution
 topic_model_temp_dir (mt) topic_model_temp_dir Path to intermediate model path (useful for restarting)
 iteration_block_size (block) iteration_block_size Number of iterations per perplexity check
 random_seed (seed) random_seed Random seed
 test_set_fraction (tf) test_set_fraction Fraction of data to hold out for testing
 num_train_threads (ntt) num_train_threads number of threads per mapper to train with
 num_update_threads (nut) num_update_threads number of threads per mapper to update the model with
 max_doc_topic_iters (mipd) max_doc_topic_iters max number of iterations per doc for p(topicdoc) learning
 num_reduce_tasks num_reduce_tasks number of reducers to use during model estimation
 backfill_perplexity enable backfilling of missing perplexity values
 help (h) Print out help
 tempDir tempDir Intermediate output directory
 startPhase startPhase First phase to run
 endPhase endPhase Last phase to run

http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/llrloglikelihoodratio.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/llrloglikelihoodratio.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/llrloglikelihoodratio.md
deleted file mode 100644
index d6b7e18..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/llrloglikelihoodratio.md
+++ /dev/null
@@ 1,46 +0,0 @@

layout: default
title: LLR  Loglikelihood Ratio
theme:
 name: retromahout


# Likelihood ratio test

_Likelihood ratio test is used to compare the fit of two models one
of which is nested within the other._

In the context of machine learning and the Mahout project in particular,
the term LLR is usually meant to refer to a test of significance for two
binomial distributions, also known as the G squared statistic. This is a
special case of the multinomial test and is closely related to mutual
information. The value of this statistic is not normally used in this
context as a true frequentist test of significance since there would be
obvious and dreadful problems to do with multiple comparisons, but rather
as a heuristic score to order pairs of items with the most interestingly
connected items having higher scores. In this usage, the LLR has proven
very useful for discriminating pairs of features that have interesting
degrees of cooccurrence and those that do not with usefully small false
positive and false negative rates. The LLR is typically far more suitable
in the case of small than many other measures such as Pearson's
correlation, Pearson's chi squared statistic or z statistics. The LLR as
stated does not, however, make any use of rating data which can limit its
applicability in problems such as the Netflix competition.

The actual value of the LLR is not usually very helpful other than as a way
of ordering pairs of items. As such, it is often used to determine a
sparse set of coefficients to be estimated by other means such as TFIDF.
Since the actual estimation of these coefficients can be done in a way that
is independent of the training data such as by general corpus statistics,
and since the ordering imposed by the LLR is relatively robust to counting
fluctuation, this technique can provide very strong results in very sparse
problems where the potential number of features vastly outnumbers the
number of training examples and where features are highly interdependent.

 See Also:

* [Blog post "surprise and coincidence"](http://tdunning.blogspot.com/2008/03/surpriseandcoincidence.html)
* [GTest](http://en.wikipedia.org/wiki/Gtest)
* [Likelihood Ratio Test](http://en.wikipedia.org/wiki/Likelihoodratio_test)


\ No newline at end of file
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/spectralclustering.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/spectralclustering.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/spectralclustering.md
deleted file mode 100644
index d0f5199..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/spectralclustering.md
+++ /dev/null
@@ 1,84 +0,0 @@

layout: default
title: Spectral Clustering
theme:
 name: retromahout


# Spectral Clustering Overview

Spectral clustering, as its name implies, makes use of the spectrum (or eigenvalues) of the similarity matrix of the data. It examines the _connectedness_ of the data, whereas other clustering algorithms such as kmeans use the _compactness_ to assign clusters. Consequently, in situations where kmeans performs well, spectral clustering will also perform well. Additionally, there are situations in which kmeans will underperform (e.g. concentric circles), but spectral clustering will be able to segment the underlying clusters. Spectral clustering is also very useful for image segmentation.

At its simplest, spectral clustering relies on the following four steps:

 1. Computing a similarity (or _affinity_) matrix `\(\mathbf{A}\)` from the data. This involves determining a pairwise distance function `\(f\)` that takes a pair of data points and returns a scalar.

 2. Computing a graph Laplacian `\(\mathbf{L}\)` from the affinity matrix. There are several types of graph Laplacians; which is used will often depends on the situation.

 3. Computing the eigenvectors and eigenvalues of `\(\mathbf{L}\)`. The degree of this decomposition is often modulated by `\(k\)`, or the number of clusters. Put another way, `\(k\)` eigenvectors and eigenvalues are computed.

 4. The `\(k\)` eigenvectors are used as "proxy" data for the original dataset, and fed into kmeans clustering. The resulting cluster assignments are transparently passed back to the original data.

For more theoretical background on spectral clustering, such as how affinity matrices are computed, the different types of graph Laplacians, and whether the top or bottom eigenvectors and eigenvalues are computed, please read [Ulrike von Luxburg's article in _Statistics and Computing_ from December 2007](http://link.springer.com/article/10.1007/s112220079033z). It provides an excellent description of the linear algebra operations behind spectral clustering, and imbues a thorough understanding of the types of situations in which it can be used.

# Mahout Spectral Clustering

As of Mahout 0.3, spectral clustering has been implemented to take advantage of the MapReduce framework. It uses [SSVD](http://mahout.apache.org/users/dimreduction/ssvd.html) for dimensionality reduction of the input data set, and [kmeans](http://mahout.apache.org/users/clustering/kmeansclustering.html) to perform the final clustering.

**([MAHOUT1538](https://issues.apache.org/jira/browse/MAHOUT1538) will port the existing Hadoop MapReduce implementation to Mahout DSL, allowing for one of several distinct distributed backends to conduct the computation)**

## Input

The input format for the algorithm currently takes the form of a Hadoopbacked affinity matrix in the form of text files. Each line of the text file specifies a single element of the affinity matrix: the row index `\(i\)`, the column index `\(j\)`, and the value:

`i, j, value`

The affinity matrix is symmetric, and any unspecified `\(i, j\)` pairs are assumed to be 0 for sparsity. The row and column indices are 0indexed. Thus, only the nonzero entries of either the upper or lower triangular need be specified.

The matrix elements specified in the text files are collected into a Mahout `DistributedRowMatrix`.

**([MAHOUT1539](https://issues.apache.org/jira/browse/MAHOUT1539) will allow for the creation of the affinity matrix to occur as part of the core spectral clustering algorithm, as opposed to the current requirement that the user create this matrix themselves and provide it, rather than the original data, to the algorithm)**

## Running spectral clustering

**([MAHOUT1540](https://issues.apache.org/jira/browse/MAHOUT1540) will provide a running example of this algorithm and this section will be updated to show how to run the example and what the expected output should be; until then, this section provides a howto for simply running the algorithm on arbitrary input)**

Spectral clustering can be invoked with the following arguments.

 bin/mahout spectralkmeans \
 i <affinity matrix directory> \
 o <output working directory> \
 d <number of data points> \
 k <number of clusters AND number of top eigenvectors to use> \
 x <maximum number of kmeans iterations>

The affinity matrix can be contained in a single text file (using the aforementioned onelineperentry format) or span many text files [per (MAHOUT978](https://issues.apache.org/jira/browse/MAHOUT978), do not prefix text files with a leading underscore '_' or period '.'). The `d` flag is required for the algorithm to know the dimensions of the affinity matrix. `k` is the number of top eigenvectors from the normalized graph Laplacian in the SSVD step, and also the number of clusters given to kmeans after the SSVD step.

## Example

To provide a simple example, take the following affinity matrix, contained in a text file called `affinity.txt`:

 0, 0, 0
 0, 1, 0.8
 0, 2, 0.5
 1, 0, 0.8
 1, 1, 0
 1, 2, 0.9
 2, 0, 0.5
 2, 1, 0.9
 2, 2, 0

With this 3by3 matrix, `d` would be `3`. Furthermore, since all affinity matrices are assumed to be symmetric, the entries specifying both `1, 2, 0.9` and `2, 1, 0.9` are redundant; only one of these is needed. Additionally, any entries that are 0, such as those along the diagonal, also need not be specified at all. They are provided here for completeness.

In general, larger values indicate a stronger "connectedness", whereas smaller values indicate a weaker connectedness. This will vary somewhat depending on the distance function used, though a common one is the [RBF kernel](http://en.wikipedia.org/wiki/RBF_kernel) (used in the above example) which returns values in the range [0, 1], where 0 indicates completely disconnected (or completely dissimilar) and 1 is fully connected (or identical).

The call signature with this matrix could be as follows:

 bin/mahout spectralkmeans \
 i s3://mahoutexample/input/ \
 o s3://mahoutexample/output/ \
 d 3 \
 k 2 \
 x 10

There are many other optional arguments, in particular for tweaking the SSVD process (block size, number of power iterations, etc) and the kmeans clustering step (distance measure, convergence delta, etc).
\ No newline at end of file
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/streamingkmeans.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/streamingkmeans.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/streamingkmeans.md
deleted file mode 100644
index 81248de..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/streamingkmeans.md
+++ /dev/null
@@ 1,174 +0,0 @@

layout: default
title: Spectral Clustering
theme:
 name: retromahout


# *StreamingKMeans* algorithm

The *StreamingKMeans* algorithm is a variant of Algorithm 1 from [Shindler et al][1] and consists of two steps:

 1. Streaming step
 2. BallKMeans step.

The streaming step is a randomized algorithm that makes one pass through the data and
produces as many centroids as it determines is optimal. This step can be viewed as
a preparatory dimensionality reduction. If the size of the data stream is *n* and the
expected number of clusters is *k*, the streaming step will produce roughly *k\*log(n)*
clusters that will be passed on to the BallKMeans step which will further reduce the
number of clusters down to *k*. BallKMeans is a randomized Lloydtype algorithm that
has been studied in detail, see [Ostrovsky et al][2].

## Streaming step



### Overview

The streaming step is a derivative of the streaming
portion of Algorithm 1 in [Shindler et al][1]. The main difference between the two is that
Algorithm 1 of [Shindler et al][1] assumes
the knowledge of the size of the data stream and uses it to set a key parameter
for the algorithm. More precisely, the initial *distanceCutoff* (defined below), which is
denoted by *f* in [Shindler et al][1], is set to *1/(k(1+log(n))*. The *distanceCutoff* influences the number of clusters that the algorithm
will produce.
In contrast, Mahout implementation does not require the knowledge of the size of the
data stream. Instead, it dynamically reevaluates the parameters that depend on the size
of the data stream at runtime as more and more data is processed. In particular,
the parameter *numClusters* (defined below) changes its value as the data is processed.

###Parameters

  **numClusters** (int): Conceptually, *numClusters* represents the algorithm's guess at the optimal
number of clusters it is shooting for. In particular, *numClusters* will increase at run
time as more and more data is processed. Note that â€˘numClustersâ€˘ is not the number of clusters that the algorithm will produce. Also, *numClusters* should not be set to the final number of clusters that we expect to receive as the output of *StreamingKMeans*.
  **distanceCutoff** (double): a parameter representing the value of the distance between a point and
its closest centroid after which
the new point will definitely be assigned to a new cluster. *distanceCutoff* can be thought
of as an estimate of the variable *f* from Shindler et al. The default initial value for
*distanceCutoff* is *1.0/numClusters* and *distanceCutoff* grows as a geometric progression with
common ratio *beta* (see below).
  **beta** (double): a constant parameter that controls the growth of *distanceCutoff*. If the initial setting of *distanceCutoff* is *d0*, *distanceCutoff* will grow as the geometric progression with initial term *d0* and common ratio *beta*. The default value for *beta* is 1.3.
  **clusterLogFactor** (double): a constant parameter such that *clusterLogFactor* *log(numProcessedPoints)* is the runtime estimate of the number of clusters to be produced by the streaming step. If the final number of clusters (that we expect *StreamingKMeans* to output) is *k*, *clusterLogFactor* can be set to *k*.
  **clusterOvershoot** (double): a constant multiplicative slack factor that slows down the collapsing of clusters. The default value is 2.


###Algorithm

The algorithm processes the data onebyone and makes only one pass through the data.
The first point from the data stream will form the centroid of the first cluster (this designation may change as more points are processed). Suppose there are *r* clusters at one point and a new point *p* is being processed. The new point can either be added to one of the existing *r* clusters or become a new cluster. To decide:

  let *c* be the closest cluster to point *p*
  let *d* be the distance between *c* and *p*
  if *d > distanceCutoff*, create a new cluster from *p* (*p* is too far away from the clusters to be part of any one of them)
  else (*d <= distanceCutoff*), create a new cluster with probability *d / distanceCutoff* (the probability of creating a new cluster increases as *d* increases).

There will be either *r* or *r+1* clusters after processing a new point.

As the number of clusters increases, it will go over the *clusterOvershoot \* numClusters* limit (*numClusters* represents a recommendation for the number of clusters that the streaming step should aim for and *clusterOvershoot* is the slack). To decrease the number of clusters the existing clusters
are treated as data points and are reclustered (collapsed). This tends to make the number of clusters go down. If the number of clusters is still too high, *distanceCutoff* is increased.

## BallKMeans step

### Overview
The algorithm is a Lloydtype algorithm that takes a set of weighted vectors and returns k centroids, see [Ostrovsky et al][2] for details. The algorithm has two stages:

 1. Seeding
 2. Ball kmeans

The seeding stage is an initial guess of where the centroids should be. The initial guess is improved using the ball kmeans stage.

### Parameters

* **numClusters** (int): the number k of centroids to return. The algorithm will return exactly this number of centroids.

* **maxNumIterations** (int): After seeding, the iterative clustering procedure will be run at most *maxNumIterations* times. 1 or 2 iterations are recommended. Increasing beyond this will increase the accuracy of the result at the expense of runtime. Each successive iteration yields diminishing returns in lowering the cost.

* **trimFraction** (double): Outliers are ignored when computing the center of mass for a cluster. For any datapoint *x*, let *c* be the nearest centroid. Let *d* be the minimum distance from *c* to another centroid. If the distance from *x* to *c* is greater than *trimFraction \* d*, then *x* is considered an outlier during that iteration of ball kmeans. The default is 9/10. In [Ostrovsky et al][2], the authors use *trimFraction* = 1/3, but this does not mean that 1/3 is optimal in practice.

* **kMeansPlusPlusInit** (boolean): If true, the seeding method is kmeans++. If false, the seeding method is to select points uniformly at random. The default is true.

* **correctWeights** (boolean): If *correctWeights* is true, outliers will be considered when calculating the weight of centroids. The default is true. Note that outliers are not considered when calculating the position of centroids.

* **testProbability** (double): If *testProbability* is *p* (0 < *p* < 1), the data (of size n) is partitioned into a test set (of size *p\*n*) and a training set (of size *(1p)\*n*). If 0, no test set is created (the entire data set is used for both training and testing). The default is 0.1 if *numRuns* > 1. If *numRuns* = 1, then no test set should be created (since it is only used to compare the cost between different runs).

* **numRuns** (int): This is the number of runs to perform. The solution of lowest cost is returned. The default is 1 run.

###Algorithm
The algorithm can be instructed to take multiple independent runs (using the *numRuns* parameter) and the algorithm will select the best solution (i.e., the one with the lowest cost). In practice, one run is sufficient to find a good solution.

Each run operates as follows: a seeding procedure is used to select k centroids, and then ball kmeans is run iteratively to refine the solution.

The seeding procedure can be set to either 'uniformly at random' or 'kmeans++' using *kMeansPlusPlusInit* boolean variable. Seeding with kmeans++ involves more computation but offers better results in practice.

Each iteration of ball kmeans runs as follows:

1. Clusters are formed by assigning each datapoint to the nearest centroid
2. The centers of mass of the trimmed clusters (see *trimFraction* parameter above) become the new centroids

The data may be partitioned into a test set and a training set (see *testProbability*). The seeding procedure and ball kmeans run on the training set. The cost is computed on the test set.


##Usage of *StreamingKMeans*

 bin/mahout streamingkmeans
 i <input>
 o <output>
 ow
 k <k>
 km <estimatedNumMapClusters>
 e <estimatedDistanceCutoff>
 mi <maxNumIterations>
 tf <trimFraction>
 ri
 iw
 testp <testProbability>
 nbkm <numBallKMeansRuns>
 dm <distanceMeasure>
 sc <searcherClass>
 np <numProjections>
 s <searchSize>
 rskm
 xm <method>
 h
 tempDir <tempDir>
 startPhase <startPhase>
 endPhase <endPhase>


###Details on JobSpecific Options:

 * `input (i) <input>`: Path to job input directory.
 * `output (o) <output>`: The directory pathname for output.
 * `overwrite (ow)`: If present, overwrite the output directory before running job.
 * `numClusters (k) <k>`: The k in kMeans. Approximately this many clusters will be generated.
 * `estimatedNumMapClusters (km) <estimatedNumMapClusters>`: The estimated number of clusters to use for the Map phase of the job when running StreamingKMeans. This should be around k \* log(n), where k is the final number of clusters and n is the total number of data points to cluster.
 * `estimatedDistanceCutoff (e) <estimatedDistanceCutoff>`: The initial estimated distance cutoff between two points for forming new clusters. If no value is given, it's estimated from the data set
 * `maxNumIterations (mi) <maxNumIterations>`: The maximum number of iterations to run for the BallKMeans algorithm used by the reducer. If no value is given, defaults to 10.
 * `trimFraction (tf) <trimFraction>`: The 'ball' aspect of ball kmeans means that only the closest points to the centroid will actually be used for updating. The fraction of the points to be used is those points whose distance to the center is within trimFraction \* distance to the closest other center. If no value is given, defaults to 0.9.
 * `randomInit` (`ri`) Whether to use kmeans++ initialization or random initialization of the seed centroids. Essentially, kmeans++ provides better clusters, but takes longer, whereas random initialization takes less time, but produces worse clusters, and tends to fail more often and needs multiple runs to compare to kmeans++. If set, uses the random initialization.
 * `ignoreWeights (iw)`: Whether to correct the weights of the centroids after the clustering is done. The weights end up being wrong because of the trimFraction and possible train/test splits. In some cases, especially in a pipeline, having an accurate count of the weights is useful. If set, ignores the final weights.
 * `testProbability (testp) <testProbability>`: A double value between 0 and 1 that represents the percentage of points to be used for 'testing' different clustering runs in the final BallKMeans step. If no value is given, defaults to 0.1
 * `numBallKMeansRuns (nbkm) <numBallKMeansRuns>`: Number of BallKMeans runs to use at the end to try to cluster the points. If no value is given, defaults to 4
 * `distanceMeasure (dm) <distanceMeasure>`: The classname of the DistanceMeasure. Default is SquaredEuclidean.
 * `searcherClass (sc) <searcherClass>`: The type of searcher to be used when performing nearest neighbor searches. Defaults to ProjectionSearch.
 * `numProjections (np) <numProjections>`: The number of projections considered in estimating the distances between vectors. Only used when the distance measure requested is either ProjectionSearch or FastProjectionSearch. If no value is given, defaults to 3.
 * `searchSize (s) <searchSize>`: In more efficient searches (non BruteSearch), not all distances are calculated for determining the nearest neighbors. The number of elements whose distances from the query vector is actually computer is proportional to searchSize. If no value is given, defaults to 1.
 * `reduceStreamingKMeans (rskm)`: There might be too many intermediate clusters from the mapper to fit into memory, so the reducer can run another pass of StreamingKMeans to collapse them down to a fewer clusters.
 * `method (xm)` method The execution method to use: sequential or mapreduce. Default is mapreduce.
 * ` help (h)`: Print out help
 * `tempDir <tempDir>`: Intermediate output directory.
 * `startPhase <startPhase>` First phase to run.
 * `endPhase <endPhase>` Last phase to run.


##References

1. [M. Shindler, A. Wong, A. Meyerson: Fast and Accurate kmeans For Large Datasets][1]
2. [R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of LloydType Methods for the kmeans Problem][2]


[1]: http://nips.cc/Conferences/2011/Program/event.php?ID=2989 "M. Shindler, A. Wong, A. Meyerson: Fast and Accurate kmeans For Large Datasets"

[2]: http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf "R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of LloydType Methods for the kmeans Problem"
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresult.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresult.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresult.md
deleted file mode 100644
index 4222732..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresult.md
+++ /dev/null
@@ 1,15 +0,0 @@

layout: default
title: Viewing Result
theme:
 name: retromahout

* [Algorithm Viewing pages](#ViewingResultAlgorithmViewingpages)

There are various technologies available to view the output of Mahout
algorithms.
* Clusters

<a name="ViewingResultAlgorithmViewingpages"></a>
# Algorithm Viewing pages
{pagetree:root=@selfexcerpt=trueexpandCollapseAll=true}
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresults.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresults.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresults.md
deleted file mode 100644
index aacdd67..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/viewingresults.md
+++ /dev/null
@@ 1,49 +0,0 @@

layout: default
title: Viewing Results
theme:
 name: retromahout

<a name="ViewingResultsIntro"></a>
# Intro

Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
sequence files or other data structures. This page is intended to
demonstrate the various ways one might inspect the outcome of various jobs.
 The page is organized by algorithms.

<a name="ViewingResultsGeneralUtilities"></a>
# General Utilities

<a name="ViewingResultsSequenceFileDumper"></a>
## Sequence File Dumper


<a name="ViewingResultsClustering"></a>
# Clustering

<a name="ViewingResultsClusterDumper"></a>
## Cluster Dumper

Run the following to print out all options:

 java cp "*" org.apache.mahout.utils.clustering.ClusterDumper help



<a name="ViewingResultsExample"></a>
### Example

 java cp "*" org.apache.mahout.utils.clustering.ClusterDumper seqFileDir
./solrclustn2/out/clusters2
 dictionary ./solrclustn2/dictionary.txt
 substring 100 pointsDir ./solrclustn2/out/points/




<a name="ViewingResultsClusterLabels(MAHOUT163)"></a>
## Cluster Labels (MAHOUT163)

<a name="ViewingResultsClassification"></a>
# Classification
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/clustering/visualizingsampleclusters.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/visualizingsampleclusters.md b/website/old_site_migration/needs_work_convenience/mapreduce/clustering/visualizingsampleclusters.md
deleted file mode 100644
index 52d07e7..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/clustering/visualizingsampleclusters.md
+++ /dev/null
@@ 1,50 +0,0 @@

layout: default
title: Visualizing Sample Clusters
theme:
 name: retromahout


<a name="VisualizingSampleClustersIntroduction"></a>
# Introduction

Mahout provides examples to visualize sample clusters that gets created by
our clustering algorithms. Note that the visualization is done by Swing programs. You have to be in a window system on the same
machine you run these, or logged in via a remote desktop.

For visualizing the clusters, you have to execute the Java
classes under *org.apache.mahout.clustering.display* package in
mahoutexamples module. The easiest way to achieve this is to [setup Mahout](users/basics/quickstart.html) in your IDE.

<a name="VisualizingSampleClustersVisualizingclusters"></a>
# Visualizing clusters

The following classes in *org.apache.mahout.clustering.display* can be run
without parameters to generate a sample data set and run the reference
clustering implementations over them:

1. **DisplayClustering**  generates 1000 samples from three, symmetric
distributions. This is the same data set that is used by the following
clustering programs. It displays the points on a screen and superimposes
the model parameters that were used to generate the points. You can edit
the *generateSamples()* method to change the sample points used by these
programs.
1. **DisplayClustering**  displays initial areas of generated points
1. **DisplayCanopy**  uses Canopy clustering
1. **DisplayKMeans**  uses kMeans clustering
1. **DisplayFuzzyKMeans**  uses Fuzzy kMeans clustering
1. **DisplaySpectralKMeans**  uses Spectral KMeans via mapreduce algorithm

If you are using Eclipse, just rightclick on each of the classes mentioned above and choose "Run As Java Application". To run these directly from the command line:

 cd $MAHOUT_HOME/examples
 mvn q exec:java Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering

You can substitute other names above for *DisplayClustering*.


Note that some of these programs display the sample points and then superimpose all of the clusters from each iteration. The last iteration's clusters are in
bold red and the previous several are colored (orange, yellow, green, blue,
magenta) in order after which all earlier clusters are in light grey. This
helps to visualize how the clusters converge upon a solution over multiple
iterations.
\ No newline at end of file
http://gitwipus.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/mapreduce/misc/mrmapreduce.md

diff git a/website/old_site_migration/needs_work_convenience/mapreduce/misc/mrmapreduce.md b/website/old_site_migration/needs_work_convenience/mapreduce/misc/mrmapreduce.md
deleted file mode 100644
index b03d6ad..0000000
 a/website/old_site_migration/needs_work_convenience/mapreduce/misc/mrmapreduce.md
+++ /dev/null
@@ 1,19 +0,0 @@

layout: default
title: MR  Map Reduce
theme:
 name: retromahout


{excerpt}MapReduce is a framework for processing huge datasets on certain
kinds of distributable problems using a large number of computers (nodes),
collectively referred to as a cluster.{excerpt} Computational processing
can occur on data stored either in a filesystem (unstructured) or within a
database (structured).

 Also written M/R


 See Also
* [http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce)
* [http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce)
