Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: ClusteringYourData (http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData)
Edited by Isabel Drost:

+*Mahout_0.2*+
After you've done the [QuickStart] and are familiar with the basics of Mahout, it is time
to cluster your own data.
The following pieces *may* be useful for in getting started:
h1. Input
For starters, you will need your data in an appropriate Vector format (which has changed since
Mahout 0.1)
* See [Creating Vectors]
h2. Text Preparation
* See [Creating Vectors from Text]
* http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
h1. Running the Process
h2. Canopy
Background: [canopy  Canopy Clustering]
Documentation of running canopy from the command line: [canopycommandline] +*TODO*+
h2. kMeans
Background: [kmeans]
Documentation of running kMeans from the command line: [kmeanscommandline] +*TODO*+
Documentation of running fuzzy kMeans from the command line: [fuzzykmeanscommandline]
h2. Dirichlet
Background: [dirichlet  Dirichlet Process Clustering]
Documentation of running dirichlet from the command line: [dirichletcommandline]
h2. Meanshift
Background: [meanshift  Mean Shift]
Documentation of running mean shift from the command line: [meanshiftcommandline] +*TODO*+
h2. Latent Dirichlet Allocation
Background and documentation: [LDA Latent Dirichlet Allocation]
h1. Retrieving the Output
+*TODO*+
h1. Validating the Output
>From Ted Dunning's response on See http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
{quote}
A principled approach to cluster evaluation is to measure how well the cluster membership
captures the structure of unseen data. A natural measure for this is to measure how much
of the entropy of the data is captured by cluster membership. For kmeans and its natural
L_2 metric, the natural cluster quality metric is the squared distance from the nearest centroid
adjusted by the log_2 of the number of clusters. This can be compared to the squared magnitude
of the original data or the squared deviation from the centroid for all of the data. The
idea is that you are changing the representation of the data by allocating some of the bits
in your original representation to represent which cluster each point is in. If those bits
aren't made up by the residue being small then your clustering is making a bad tradeoff.
In the past, I have used other more heuristic measures as well. One of the key characteristics
that I would like to see out of a clustering is a degree of stability. Thus, I look at the
fractions of points that are assigned to each cluster or the distribution of distances from
the cluster centroid. These values should be relatively stable when applied to heldout data.
For text, you can actually compute perplexity which measures how well cluster membership predicts
what words are used. This is nice because you don't have to worry about the entropy of real
valued numbers.
Manual inspection and the socalled laugh test is also important. The idea is that the results
should not be so ludicrous as to make you laugh. Unfortunately, it is pretty easy to kid yourself
into thinking your system is working using this kind of inspection. The problem is that we
are too good at seeing (making up) patterns.
{quote}
h1. References
* [Mahout archive referenceshttp://www.lucidimagination.com/search/p:mahout?q=clustering]
Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
