Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Fuzzy KMeans (https://cwiki.apache.org/confluence/display/MAHOUT/Fuzzy+KMeans)
Edited by Jeff Eastman:

Fuzzy KMeans (also called Fuzzy CMeans) is an extension of [KMeanshttp://cwiki.apache.org/MAHOUT/kmeans.html],
the popular simple clustering technique. While KMeans discovers hard clusters (a point belong
to only one cluster), Fuzzy KMeans is a more statistically formalized method and discovers
soft clusters where a particular point can belong to more than one cluster with certain probability.
h4. Algorithm
Like KMeans, Fuzzy KMeans works on those objects which can be represented in ndimensional
vector space and a distance measure is defined.
The algorithm is similar to kmeans.
* Initialize k clusters
* Until converged
** Compute the probability of a point belong to a cluster for every <point,cluster>
pair
** Recompute the cluster centers using above probability membership values of points to clusters
h4. Design Implementation
The design is similar to KMeans present in Mahout. It accepts an input file containing vector
points. User can either provide the cluster centers as input or can allow canopy algorithm
to run and create initial clusters.
Similar to KMeans, the program doesn't modify the input directories. And for every iteration,
the cluster output is stored in a directory clusterN. The code has set number of reduce tasks
equal to number of map tasks. So, those many part0
\\
\* files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as
follows:
FuzzyKMeansDriver  This is similar to KMeansDriver. It iterates over input points
and cluster points for specified number of iterations or until it is converged.During every
iteration i, a new clusteri directory is created which contains the modified cluster centers
obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.
Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map
task is run to output "the point and the cluster membership to each cluster" pair as final
output to a directory named "points".
FuzzyKMeansMapper  reads the input cluster during its configure() method, then
computes cluster membership probability of a point to each cluster.Cluster membership is inversely
propotional to the distance. Distance is computed using user supplied distance measure.
Output key is encoded cluster. Output values is. <probabilityValue>:<inputPoint>
FuzzyKMeansCombiner  receives all key:value pairs from the mapper and produces partial sums
of the cluster membership probability times input vectors for each cluster. Output key is:
encoded cluster. Output value is "<sum of cluster membership values in the partial sum>,
<partial sum vector summing all such points>".
FuzzyKMeansReducer  Multiple reducers receives certain keys and all values associated with
those keys. The reducer sums the values to produce a new centroid for the cluster which is
output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted
cluster (e.g. "C14  \[c1, c2, ..., cn, \]). The reducer encodes unconverged clusters with
a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.
h2. Running Fuzzy kMeans Clustering
The Fuzzy kMeans clustering algorithm may be run using a commandline invocation on FuzzyKMeansDriver.main
or by making a Java call to FuzzyKMeansDriver.runJob(). Both require several arguments:
# input: a file path string to a directory containing the input data set a SequenceFile(WritableComparable,
VectorWritable). The sequence file _key_ is not used.
# clustersIn: a file path string to a directory containing the initial clusters, a SequenceFile(key,
SoftCluster  Cluster  Canopy). Fuzzy kMeans SoftClusters, kMeans Clusters and Canopy Canopies
may be used for the initial clusters.
# output: a file path string to an empty directory which is used for all output from the algorithm.
# measure: the fullyqualified class name of an instance of DistanceMeasure which will be
used for the clustering.
# convergence: a double value used to determine if the algorithm has converged (clusters have
not moved more than the value in the last iteration)
# maxiterations: the maximum number of iterations to run, independent of the convergence
specified
# nummappers: the number of mapper tasks to be launched.
# numreducers: the number of reducer tasks to be launched. Each reducer will process a subset
of the clusters, in the limit, one per cluster.
# m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is equivalent to normalising
the coefficient linearly to make their sum 1. When m is close to 1, then the cluster center
closest to the point is given much more weight than the others, and the algorithm is similar
to kmeans.
# runClustering: a boolean indicating, if true, that the clustering step is to be executed
after clusters have been determined.
# emitMostLikely: a boolean indicating, if true, that the clustering step should only emit
the most likely cluster for each clustered point.
# threshold: a double indicating, if emitMostLikely is false, the cluster probability threshold
used for emitting multiple clusters for each point. A value of 0 will emit all clusters with
their associated probabilities for each vector.
After running the algorithm, the output directory will contain:
# clustersN: directories containing SequenceFiles(Text, SoftCluster) produced by the algorithm
for each iteration. The Text _key_ is a cluster identifier string.
# clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable,
WeightedVectorWritable). The IntWritable _key_ is the clusterId. The WeightedVectorWritable
_value_ is a bean containing a double _weight_ and a VectorWritable _vector_ where the weight
indicates the probability that the vector is a member of the cluster.
h1. Examples
The following images illustrate Fuzzy kMeans clustering applied to a set of randomlygenerated
2d data points. The points are generated using a normal distribution centered at a mean location
and with a constant standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/dirichlet/README.txthttp://svn.apache.org/repos/asf/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/dirichlet/README.txt]
for details on running similar examples.
The points are generated as follows:
* 500 samples m=\[1.0, 1.0\] sd=3.0
* 300 samples m=\[1.0, 0.0\] sd=0.5
* 300 samples m=\[0.0, 2.0\] sd=0.1
In the first image, the points are plotted and the 3sigma boundaries of their generator are
superimposed.
!SampleData.png!
In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data.
As Fuzzy kMeans is an iterative algorithm, the centers of the clusters in each recent iteration
are shown using different colors. Bold red is the final clustering and previous iterations
are shown in \[orange, yellow, green, blue, violet and gray\]. Although it misses a lot of
the points and cannot capture the original, superimposed cluster centers, it does a decent
job of clustering this data.
!FuzzyKMeans.png!
The third image shows the results of running Fuzzy kMeans on a different data set (see [Dirichlet
Process Clustering] for details) which is generated using asymmetrical standard deviations.
Fuzzy kMeans does a fair job handling this data set as well.
!2dFuzzyKMeans.png!
h4. References
* [http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_cmeans_clustering]
Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
