mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout: Canopy Clustering (page edited)
Date Tue, 12 Feb 2008 01:29:00 GMT
Canopy Clustering (MAHOUT) edited by Jeff Eastman
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=75998&originalVersion=4&revisedVersion=5






Content:
---------------------------------------------------------------------

h1. Canopy Clustering

[Canopy Clustering|http://www.kamalnigam.com/papers/canopy-kdd00.pdf] is a very simple, fast
and surprisingly accurate method for grouping objects into clusters. All objects are represented
as a point in a multidimensional feature space. The algorithm uses a fast approximate distance
metric and two distance thresholds T1 > T2 for processing. The basic algorithm is to begin
with a set of points and remove one at random. Create a Canopy containing this point and iterate
through the remainder of the point set. At each point, if its distance from the first point
is < T1, then add the point to the cluster. If, in addition, the distance is < T2, then
remove the point from the set. This way points that are very close to the original will avoid
all further processing. The algorithm loops until the initial set is empty, accumulating a
set of Canopies, each containing one or more points. A given point may occur in more than
one Canopy.

Canopy Clustering is often used as an initial step in more rigorous clustering techniques,
such as [k-Means]. By starting with an initial clustering the number of more expensive distance
measurements can be significantly reduced by ignoring points outside of the initial canopies.

h2. Strategy for parallelization

Looking at the sample Hadoop implementation in [http://code.google.com/p/canopy-clustering/]
the processing is done in 3 M/R steps:
# The data is massaged into suitable input format
# Each mapper performs canopy clustering on the points in its input set and outputs its canopies'
centers
# The reducer clusters the canopy centers to produce the final canopy centers
# The points are then clustered into these final canopies

Some ideas can be found in [Cluster computing and MapReduce|http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html]
lecture video series \[by Google(r)\]; Canopy Clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q].
Slides can be found [here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt].
Finally here is the [Wikipedia page|http://en.wikipedia.org/wiki/Canopy_clustering_algorithm].

h2. Design of implementation

The initial implementation accepts input files containing multidimensional points (Float[])
that are comma-terminated values enclosed in brackets (e.g. "\[1.5,2.5,\]"). Processing is
done in two phases: Canopy generation and Clustering.

h3. Canopy generation phase

During the map step, each mapper processes a subset of the total points and applies the chosen
distance measure and thresholds to generate canopies. In the mapper, each point which is found
to be within an existing canopy will be output using that canopy's id to a combiner. After
sorting by canopyId keys has occurred, the combiner will see an iterator of all points for
each canopyId key. The combiner sums all of the points having that key and normalizes the
total to produce a canopy centroid which is output, using a constant key ("centroid") to a
single reducer. The reducer receives all of the initial centroids and again applies the canopy
measure and thresholds to produce a final set of canopy centroids which is output (i.e. clustering
the cluster centroids). The reducer output format is: canopyId\t\[<canopy-centroid-coordinates>\].

h3. Clustering phase

During the clustering phase, each mapper reads the canopy centroids produced by the first
phase. Since all mappers have the same canopy definitions, their outputs will be combined
during the shuffle so that each reducer (many are allowed here) will see all of the points
assigned to one or more canopies. The output format will then be: <canopy-definition>\t\[<member-point-coordinates>\]
<payload>. My plan is to include the canopyId, measure, thresholds and centroid in the
<canopy-definition> so that the output will be self-descriptive. The plan is also to
allow any information encoded in the input points after the coordinate delimiter '\]' to be
treated as payload and passed through the clustering phase without modification.

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence



Mime
View raw message