Fuzzy KMeans (MAHOUT) edited by Jeff Eastman
Page: http://cwiki.apache.org/confluence/display/MAHOUT/Fuzzy+KMeans
Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=95315&originalVersion=7&revisedVersion=8
Content:

Fuzzy KMeans (also called Fuzzy CMeans) is an extension of [KMeanshttp://cwiki.apache.org/MAHOUT/kmeans.html],
the popular simple clustering technique. While KMeans discovers hard clusters (a point belong
to only one cluster), Fuzzy KMeans is a more statistically formalized method and discovers
soft clusters where a particular point can belong to more than one cluster with certain probability.
h4. Algorithm
Like KMeans, Fuzzy KMeans works on those objects which can be represented in ndimensional
vector space and a distance measure is defined.
The algorithm is similar to kmeans.
* Initialize k clusters
* Until converged
** Compute the probability of a point belong to a cluster for every <point,cluster>
pair
** Recompute the cluster centers using above probability membership values of points to clusters
h4. Design Implementation
The design is similar to KMeans present in Mahout. It accepts an input file containing vector
points. User can either provide the cluster centers as input or can allow canopy algorithm
to run and create initial clusters.
Similar to KMeans, the program doesn't modify the input directories. And for every iteration,
the cluster output is stored in a directory clusterN. The code has set number of reduce tasks
equal to number of map tasks. So, those many part0
\\
\* files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as
follows:
FuzzyKMeansDriver  This is similar to KMeansDriver. It iterates over input points
and cluster points for specified number of iterations or until it is converged.During every
iteration i, a new clusteri directory is created which contains the modified cluster centers
obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.
Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map
task is run to output "the point and the cluster membership to each cluster" pair as final
output to a directory named "points".
FuzzyKMeansMapper  reads the input cluster during its configure() method, then
computes cluster membership probability of a point to each cluster.Cluster membership is inversely
propotional to the distance. Distance is computed using user supplied distance measure.
Output key is encoded cluster. Output values is. <probabilityValue>:<inputPoint>
FuzzyKMeansCombiner  receives all key:value pairs from the mapper and produces partial sums
of the cluster membership probability times input vectors for each cluster. Output key is:
encoded cluster. Output value is "<sum of cluster membership values in the partial sum>,
<partial sum vector summing all such points>".
FuzzyKMeansReducer  Multiple reducers receives certain keys and all values associated with
those keys. The reducer sums the values to produce a new centroid for the cluster which is
output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted
cluster (e.g. "C14  \[c1, c2, ..., cn, \]). The reducer encodes unconverged clusters with
a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.
h4. References
* [http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_cmeans_clustering]

CONFLUENCE INFORMATION
This message is automatically generated by Confluence
Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action
If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action
If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence
