mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Eastman (JIRA)" <>
Subject [jira] Commented: (MAHOUT-236) Cluster Evaluation Tools
Date Tue, 20 Apr 2010 19:10:51 GMT


Jeff Eastman commented on MAHOUT-236:

I'm running into a challenge integrating Fuzzy KMeans (and Dirichlet) into this evaluator.
Currently the clustering step of the fuzzyK emits the vector as key and a FuzzyKMeansOutput
writable as the value of the sequence file. This is backwards from the [clusterId :: VectorWritable]
encoding that the patch uses for Canopy and KMeans. Also the Fuzzy...Output bean contains
all of the clusters and the probability the vector is a member of each; rather large to be
a key. 

For CDbw to find the reference points it really needs to iterate over [clusterId :: VectorWritable]
pairs and this begs the question of what to do with fuzzy membership. I don't know if CDbw
can be adjusted to handle fuzzyness in general but it will probably will work with some points
assigned to more than one cluster. Does it make sense to apply a settable threshold to the
clustering step so that all points with cluster membership probability > threshold would
be assigned to that cluster?

This would work also for Dirichlet. To implement in fuzzyK I would need to change the FuzzyKMeansClusterer
and FuzzyKMeansClusterMapper to match the other clustering jobs.

Does this make sense?

> Cluster Evaluation Tools
> ------------------------
>                 Key: MAHOUT-236
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Grant Ingersoll
>         Attachments: MAHOUT-236.patch
> Per,
it would be great to have some utilities to help evaluate the effectiveness of clustering.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message