mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Understanding Canopy/Map Reduce
Date Tue, 22 Sep 2009 21:56:34 GMT
Shashikant Kore wrote:
> Hi,
> I am unable to understand how the Canopy clustering works.
> In Map stage, Canopy.addPointToCanopies() is called for every point
> with list of canopies. This method adds to the existing canopy or
> creates new one or both depending on the distance of the vector from
> existing canopy centroids.  Map stage outputs all the canopy centroids
> (with key "centroid").
> In reduce phase,  these centroids will again undergo the same process
> (so, possible merges) and finally centroids will be output'ed. But, I
> see that in CanopyReducer the input values are the input vectors and
> not the centroids received from the Map stage.
> I think, I missing something here. Can you please let me know what it is?
You are correct in most of your analysis, but the vectors processed by 
the reducer are the centroid vectors produced by the mapper(s). Since 
multiple mappers may each see only a portion of the dataset, the reducer 
does a final canopy cluster of the centroids so that similar centroids 
get coalesced.

> Note: I am using CanopyDriver utility (and not CanopyClusteringJob).
> Thanks,
> --shashi

View raw message