mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Lucene Mahout: Fuzzy K-Means (page edited)
Date Mon, 11 May 2009 07:40:00 GMT
Fuzzy K-Means (MAHOUT) edited by Jeff Eastman


Fuzzy K-Means (also called Fuzzy C-Means) is an extension of [K-Means|],
the popular simple clustering technique. While K-Means discovers hard clusters (a point belong
to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers
soft clusters where a particular point can belong to more than one cluster with certain probability.

h4. Algorithm

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional
vector space and a distance measure is defined.
The algorithm is similar to k-means.
* Initialize k clusters
* Until converged
** Compute the probability of a point belong to a cluster for every <point,cluster>
** Recompute the cluster centers using above probability membership values of points to clusters

h4. Design Implementation

The design is similar to K-Means present in Mahout. It accepts an input file containing vector
points. User can either provide the cluster centers as input or can allow canopy algorithm
to run and create initial clusters.

Similar to K-Means, the program doesn't modify the input directories. And for every iteration,
the cluster output is stored in a directory cluster-N. The code has set number of reduce tasks
equal to number of map tasks. So, those many part-0
\* files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as

FuzzyKMeansDriver - This is similar to&nbsp; KMeansDriver. It iterates over input points
and cluster points for specified number of iterations or until it is converged.During every
iteration i, a new cluster-i directory is created which contains the modified cluster centers
obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.&nbsp;
Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map
task is run to output "the point and the cluster membership to each cluster" pair as final
output to a directory named "points".

FuzzyKMeansMapper - reads the input cluster during its configure() method, then&nbsp;
computes cluster membership probability of a point to each cluster.Cluster membership is inversely
propotional to the distance. Distance is computed using&nbsp; user supplied distance measure.
Output key is encoded cluster. Output values is. <probabilityValue>:<inputPoint>

FuzzyKMeansCombiner - receives all key:value pairs from the mapper and produces partial sums
of the cluster membership probability times input vectors for each cluster. Output key is:
encoded cluster. Output value is "<sum of cluster membership values in the partial sum>,
<partial sum vector summing all such points>".

FuzzyKMeansReducer - Multiple reducers receives certain keys and all values associated with
those keys. The reducer sums the values to produce a new centroid for the cluster which is
output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted
cluster (e.g. "C14 - \[c1, c2, ..., cn, \]). The reducer encodes unconverged clusters with
a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.

h4. References&nbsp;

* []

This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences

If you think it was sent incorrectly contact one of the administrators

If you want more information on Confluence, or have a bug to report see

View raw message