mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout: Mean Shift (page created)
Date Wed, 02 Apr 2008 05:23:00 GMT
Mean Shift (MAHOUT) created by Jeff Eastman
   http://cwiki.apache.org/confluence/display/MAHOUT/Mean+Shift

Content:
---------------------------------------------------------------------

h1. Mean Shift Clustering

"Mean Shift: A Robust Approach to Feature Space Analysis" (http://www.caip.rutgers.edu/riul/research/papers/pdf/mnshft.pdf)
introduces the geneology of the mean shift custering procedure which dates back to work in
pattern recognition in 1975. The paper contains a detailed derivation and several examples
of the use of mean shift for image smooting and segmentation. "Mean Shift Clustering" (http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf)
presents an overview of the algorithm with a summary of the derivation. An attractive feature
of mean shift clustering is that it does not require a-priori knowledge of the number of clusters
(as required in k-means) and it will produce arbitrarily-shaped clusters that depend upon
the topology of the data (unlike canopy).

The algorithm begins with a set of datapoints, and creates a fixed-radius window for each.
It then iterates over each window, calculating a mean shift vector which points in the direction
of the maximum increase in the local density function. Then each window is migrated to a new
position via the vector and the iteration resumes. Iterations complete when each window has
reached a local maximum in the density function and the vector becomes negligable.

h2. Reference Implementation

The implementation introduced by MAHOUT-15 uses modified Canopy Clustering canopies to represent
the mean shift windows. 
* It uses the canopy's T1 distance threshold as the radius of the window, and the canopy's
T2 threshold to decide when two canopies have converged and will thereby follow the same path.

* Each canopy contains one or more bound points which are represented using Vectors. 
* The algorithm is initialized with a canopy containing each input point. 
* During each iteration, every canopy calculates its mean shift vector by summing the canopies
within its T1 threshold. 
** The centers are weighted in proportion to their numbers of bound points (weighted pair-group
centroid)
* If any canopies are within their T2 thresholds they are merged and their respective bound
points are accumulated. 
* The iterations complete when each canopy's mean shift vector has a magnitude less than a
given termination delta. 
* Upon termination, the remaining canopies contain sets of points which are the members of
their cluster.

h2. Map/Reduce Implementation

* Each mapper receives a subset of the canopies for each iteration. It compares each canopy
with each one it has already seen and performs the T1 and T2 distance tests using an arbitrary
user-supplied DistanceMeasure. It outputs three different types of records based upon these
comparisons:
** When a canopy distance measure is less than T1, the center of each canopy and its number
of bound points are output to the combiner, keyed by each respective canopyId
** When the measure is less than T2, a merge record is output to the combiner which contains
its bound points
** If no merge record emitted, an init record is output to the combiner which contains the
canopy's complete state
 * The combiner receives records for each point and summarizes them. It moves the canopy to
its new centroid position and outputs the canopy to the reducer with a constant key
 * A single reducer coalesces all the canopies from the combiners by performing another clustering
iteration on them.
 * A driver class performs a single iteration over the input canopies and produces an output
set
 * A job class manages the iteration and determines when either a maximum number of iterations
occur or the termination criteria is reached. 
** It is possible, but not implemented, to execute the driver with no reducer. This would
cause the mapper/combiner groups to reduce the overall canopy population through merging before
sending them all to a single reducer.

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence



Mime
View raw message