mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Understanding Canopy/Map Reduce
Date Tue, 22 Sep 2009 21:56:34 GMT
Shashikant Kore wrote:
> Hi,
>
> I am unable to understand how the Canopy clustering works.
>
> In Map stage, Canopy.addPointToCanopies() is called for every point
> with list of canopies. This method adds to the existing canopy or
> creates new one or both depending on the distance of the vector from
> existing canopy centroids.  Map stage outputs all the canopy centroids
> (with key "centroid").
>
> In reduce phase,  these centroids will again undergo the same process
> (so, possible merges) and finally centroids will be output'ed. But, I
> see that in CanopyReducer the input values are the input vectors and
> not the centroids received from the Map stage.
>
> I think, I missing something here. Can you please let me know what it is?
>   
You are correct in most of your analysis, but the vectors processed by 
the reducer are the centroid vectors produced by the mapper(s). Since 
multiple mappers may each see only a portion of the dataset, the reducer 
does a final canopy cluster of the centroids so that similar centroids 
get coalesced.


> Note: I am using CanopyDriver utility (and not CanopyClusteringJob).
>
> Thanks,
>
> --shashi
>
>
>   


Mime
View raw message