Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Dirichlet Process Clustering (http://cwiki.apache.org/confluence/display/MAHOUT/Dirichlet+Process+Clustering)
Change Comment:

Updated Dirichlet to reflect recent changes in implementation
Edited by Jeff Eastman:

h1. Overview
The Dirichlet Process Clustering algorithm performs Bayesian mixture modeling.
The idea is that we use a probabilistic mixture of a number of models that we use to explain
some observed data. Each observed data point is assumed to have come from one of the models
in the mixture, but we don't know which. The way we deal with that is to use a socalled
latent parameter which specifies which model each data point came from.
In addition, since this is a Bayesian clustering algorithm, we don't want to actually commit
to any single explanation, but rather to sample from the distribution of models and latent
assignments of data points to models given the observed data and the prior distributions of
model parameters. This sampling process is initialized by taking models at random from the
prior distribution for models.
Then, we iteratively assign points to the different models using the mixture probabilities
and the degree of fit between the point and each model expressed as a probability that the
point was generated by that model. After points are assigned, new parameters for each model
are sampled from the posterior distribution for the model parameters considering all of the
observed data points that were assigned to the model. Models without any data points are
also sampled, but since they have no points assigned, the new samples are effectively taken
from the prior distribution for model parameters.
The result is a number of samples that represent mixing probabilities, models and assignment
of points to models. If the total number of possible models is substantially larger than the
number that ever have points assigned to them, then this algorithm provides a (nearly) nonparametric
clustering algorithm. These samples can give us interesting information that is lacking from
a normal clustering that consists of a single assignment of points to clusters. Firstly,
by examining the number of models in each sample that actually has any points assigned to
it, we can get information about how many models (clusters) that the data support. Morevoer,
by examining how often two points are assigned to the same model, we can get an approximate
measure of how likely these points are to be explained by the same model. Such soft membership
information is difficult to come by with conventional clustering methods.
Finally, we can get an idea of the stability of how the data can be described. Typically,
aspects of the data with lots of data available wind up with stable descriptions while at
the edges, there are aspects that are phenomena that we can't really commit to a solid description,
but it is still clear that the well supported explanations are insufficient to explain these
additional aspects. One thing that can be difficult about these samples is that we can't always
assign a correlation between the models in the different samples. Probably the best way to
do this is to look for overlap in the assignments of data observations to the different models.
h2. Design of Implementation
The implementation accepts one input directory containing the data points to be clustered.
The data directory contains multiple input files of SequenceFile(key, VectorWritable). The
input data points are not modified by the implementation, allowing experimentation with initial
clustering and convergence values.
The program iterates over the input points, outputting a new directory "clustersN" containing
SequenceFile(Text, DirichletCluster) files for each iteration N. This process uses a mapper/reducer/driver
as follows:
DirichletMapper  reads the input clusters during its configure() method, then assigns and
outputs each input point to a probable cluster as defined by the model's pdf() function. Output
_key_ is: clusterId. Output _value_ is: input point.
DirichletReducer  reads the input clusters during its configure() method, then each reducer
receives clusterId:VectorWritable pairs from all mappers and accumulates them to produce a
new posterior model for each cluster which is output. Output _key_ is: clusterId. Output value
is: DirichletCluster. Reducer outputs are used as the input clusters for the next iteration.
DirichletDriver  iterates over the points and clusters until the given number of iterations
has been reached. During iterations, a new clusters directory "clustersN" is produced with
the output clusters from the previous iteration used for input to the next. A final optional
pass over the data using the DirichletClusterMapper clusters all points to an output directory
"clusteredPoints" and has no combiner or reducer steps.
h2. Running Dirichlet Process Clustering
The Dirichlet clustering algorithm may be run using a commandline invocation on DirichletDriver.main
or by making a Java call to DirichletDriver.runJob(). Both require several arguments:
# input: a file path string to a directory containing the input data set a SequenceFile(WritableComparable,
VectorWritable). The sequence file _key_ is not used.
# output: a file path string to an empty directory which is used for all output from the algorithm.
# modelFactory: the fullyqualified class name of an instance of ModelDistribution which will
be used for the clustering.
# modelPrototype: the fullyqualified class name of an instance of Vector which will be used
by the models during the clustering.
# prototypeSize: the size of the modelPrototype (could possibly be deduced from the input
data but is not currently)
# numClusters: the number of models to be used for the clustering. This should be larger than
the number of clusters which is expected in the data set.
# maxIterations: the number of iterations to run for the clustering.
# alpha_0: a double value (default is 1.0) used for creating the DirichletDistribution. Influences
the likelihood that new, empty clusters will be selected for assignment in the first iteration.
# numReducers: the number of reducers desired. This can be an integer from 1 to numClusters.
# runClustering: a boolean indicating, if true, that the clustering step is to be executed
after clusters have been determined.
# emitMostLikely: a boolean indicating, if true, that the clustering step should only emit
the most likely cluster for each clustered point.
# threshold: a double indicating, if emitMostLikely is false, the cluster probability threshold
used for emitting multiple clusters for each point. A value of 0 will emit all clusters with
their associated probabilities for each vector.
After running the algorithm, the output directory will contain:
# clustersN: directories containing SequenceFiles(Text, DirichletCluster) produced by the
algorithm for each iteration. The Text _key_ is a cluster identifier string.
# clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable,
WeightedVectorWritable). The IntWritable _key_ is the clusterId. The WeightedVectorWritable
_value_ is a bean containing a double _weight_ and a VectorWritable _vector_ where the weight
indicates the probability that the vector is a member of the cluster.
h1. Examples
The following images illustrate three different prior models applied to a set of randomlygenerated
2d data points. The points are generated using a normal distribution centered at a mean location
and with a constant standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/dirichlet/README.txthttp://svn.apache.org/repos/asf/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/dirichlet/README.txt]
for details on running similar examples.
The points are generated as follows:
* 40 samples m=\[1.0, 1.0] sd=3.0
* 30 samples m=\[1.0, 0.0] sd=0.5
* 30 samples m=\[0.0, 2.0] sd=0.1
In the first image, the points are plotted and the 3sigma boundaries of their generator are
superimposed. It is, of course, impossible to tell which model actually generated each point
as there is some probability  perhaps small  that any of the models could have generated
every point.
!DirichletData.png!
In the next image, the Dirichlet Process Clusterer is run against the sample points using
a NormalModelDistribution with m=\[0.0, 0.0] sd=1.0. This distribution represents the least
amount of prior information, as its sampled models all have constant parameters. The resulting
significant models (representing > 5% of the population) are superimposed upon the sample
data. Since all prior models are identical and their pdfs are the same, the first iteration's
assignment of points to models is completely governed by the initial mixture values. Since
these are also identical, it means the first iteration assigns points to models at random.
During subsequent iterations, the models diverge from the origin but there is some overfitting
in the result.
!DirichletN.png!
The next image improves upon this situation by using a SampledNormalDistribution. In this
distribution, the prior models have means that are sampled from a normal distribution and
all have a constant sd=1. This distribution creates initial models that are centered at different
coordinates. During the first iteration, each model thus has a different pdf for each point
and the iteration assigns points to the morelikely models given this value. The result is
a very good capture of the sample data parameters.
!DirichletSN.png!
The next image uses an AsymmetricSampledNormalDistribution in which the model's standard deviation
is also represented as a 2d vector. This causes the clusters to assume elliptical shapes
in the resulting clustering. This represents an incorrect prior assumption but it is interesting
that it fits the actual sample data quite well. Had we suspected the sample points were generated
in a similar manner then this distribution would have been the most logical model.
!DirichletASN.png!
In order to explore an asymmetrical sample data distribution, the following image shows a
larger number of points generated according to the following parameters. Again, the generator's
3sigma ellipses are superimposed:
* 400 samples m=\[1.0, 1.0] sd=\[3.0, 1.0]
* 300 samples m=\[1.0, 0.0] sd=\[0.5, 1.0]
* 300 samples m=\[0.0, 2.0] sd=\[0.1, 0.5]
!2dDirichletData.png!
The following image shows the results of applying the symmetrical SampledNormalDistribution
to the asymmetricallygenerated sample data. It does a valiant effort but does not capture
a very good set of models because the prior assumption is just wrong.
!2dDirichletSN.png!
Finally, the AsymmetricSampledNormalDistribution is run against the asymmetrical sample data.
Though there is some overfitting, it does a credible job of capturing the underlying models.
I suspect the model overfitted because its prior assumption of sd=1 is too low given the
std values used to generate the sample data. Of course, this can only be suspected because
I know the initial distributions. Without this knowledge we would have to take the clustering
analysis at face value. Nonetheless, if we suspected overfitting we might run the analysis
again with a random seed for the Random generator to see what other clusterings were possible
from the data. We might like one of those even better than this.
!2dDirichletASN.png!
h1. References
McCullagh and Yang: http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf
There is also a more approachable example in [Chris Bishop's book on Machine Learning http://research.microsoft.com/enus/um/people/cmbishop/PRML/index.htm].
I think that chapter 9 is where the example of clustering using a mixture model is found.
The Neal and Blei references from the McCullagh and Yang paper are also good. Zoubin Gharamani
has some very [nice tutorials out which describe why nonparametric Bayesian approaches to
problems are very coolhttp://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorialb.pdf], there
are video versions about as well.
Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
