mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: MAHOUT-236 Cluster Evaluation Tools?
Date Fri, 09 Apr 2010 17:51:01 GMT
That's not what I get from the paper. Certainly, the cluster center is 
the first representative point. But the paper talks about subsequently 
iterating through the clustered points to find the farthest point from 
the previously-selected representative points (RPs) and then adding that 
as another representative point. After a few such iterations, a set of 
RPs is developed for each cluster that defines the extreme points 
observed within the cluster. This is especially useful for non-spherical 
clusters, such as those returned by mean shift and Dirichlet asymmetric 
models. Then, in the final stage, the RPs in each cluster are compared 
and the closest RPs are used to compute CDbw. The final calculation can 
be done in memory since the number of clusters and RPs is well-bounded 
by then.

I get that each RP iteration takes place over all of the clustered 
points and would require a new MR job for each iteration. I imagine 
initializing the mappers and reducers with the set of clusters and their 
RPs. Then each mapper processes a subset of all clustered points, 
finally outputting the farthest it has seen for each cluster. The 
reducer gets this information and selects the RP that is absolutely the 
most distant, outputting it with the clusters+RPs for the next 
iteration. This is a lot like the way Dirichlet works now, outputting 
state to be used for the next iteration over the entire point set. We 
would need to allow a DistanceMeasure to be specified for this phase.

Currently, only canopy and kMeans actually produce their clustered 
points. Dirichlet points could be clustered by assigning each point to 
the model with the largest pdf (or even to more than one based upon a 
user-settable pdf threshold). Fuzzy kMeans would need to make similar 
assignments. MeanShift point ids are currently retained in its cluster 
state but there is no step to build clustered points like canopy and 
kMeans do. Some work would be needed here too, as we need a uniform 
representation for clustered points.

Finally, I'd like to review the output file naming conventions across 
all the clustering algorithms and converge on a single nomenclature that 
is common across all jobs.

Robin Anil wrote:
> Cluster center itself is a representative point. One pass over the data will
> get us that close enough points. Or exhaustively, we can just add it in the
> Kmeans Mapper and update a counter maybe?
> Robin

View raw message