I'd like to personally thank Ted Dunning for guiding me down this path,
I could not have done it alone. These words are mostly his, but I now
think I can speak them myself with some confidence. It has been a most
amazing journey. Now, on to a Hadoop implementation...
Jeff
confluence@apache.org wrote:
> Dirichlet Process Clustering (MAHOUT) created by Jeff Eastman
> http://cwiki.apache.org/confluence/display/MAHOUT/Dirichlet+Process+Clustering
>
> Content:
> 
>
> The Dirichlet Process Clustering algorithm performs Bayesian mixture modeling.
>
> The idea is that we use a probabilistic mixture of a number of models that we use to
explain some observed data. Each observed data point is assumed to have come from one of the
models in the mixture, but we don't know which. The way we deal with that is to use a socalled
latent parameter which specifies which model each data point came from.
>
>
> In addition, since this is a Bayesian clustering algorithm, we don't want to actually
commit to any single explanation, but rather to sample from the distribution of models and
latent assignments of data points to models given the observed data and the prior distributions
of model parameters. This sampling process is initialized by taking models at random from
the prior distribution for models.
>
> Then, we iteratively assign points to the different models using the mixture probabilities
and the degree of fit between the point and each model expressed as a probability that the
point was generated by that model. After points are assigned, new parameters for each model
are sampled from the posterior distribution for the model parameters considering all of the
observed data points that were assigned to the model. Models without any data points are
also sampled, but since they have no points assigned, the new samples are effectively taken
from the prior distribution for model parameters.
>
> The result is a number of samples that represent mixing probabilities, models and assignment
of points to models. If the total number of possible models is substantially larger than the
number that ever have points assigned to them, then this algorithm provides a (nearly) nonparametric
clustering algorithm. These samples can give us interesting information that is lacking from
a normal clustering that consists of a single assignment of points to clusters. Firstly,
by examining the number of models in each sample that actually has any points assigned to
it, we can get information about how many models (clusters) that the data support. Morevoer,
by examining how often two points are assigned to the same model, we can get an approximate
measure of how likely these points are to be explained by the same model. Such soft membership
information is difficult to come by with conventional clustering methods.
>
> Finally, we can get an idea of the stability of how the data can be described. Typically,
aspects of the data with lots of data available wind up with stable descriptions while at
the edges, there are aspects that are phenomena that we can't really commit to a solid description,
but it is still clear that the well supported explanations are insufficient to explain these
additional aspects. One thing that can be difficult about these samples is that we can't always
assign a correlation between the models in the different samples. Probably the best way to
do this is to look for overlap in the assignments of data observations to the different models.
>
>
> 
> CONFLUENCE INFORMATION
> This message is automatically generated by Confluence
>
> Unsubscribe or edit your notifications preferences
> http://cwiki.apache.org/confluence/users/viewnotifications.action
>
> If you think it was sent incorrectly contact one of the administrators
> http://cwiki.apache.org/confluence/administrators.action
>
> If you want more information on Confluence, or have a bug to report see
> http://www.atlassian.com/software/confluence
>
>
>
>
>
