mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Judging the quality of clustering
Date Thu, 17 May 2012 21:33:13 GMT
Hi Pat,

I don't have a good answer here. Evidently, something in CDbw has become 
broken and you are the first to notice. When I run TestCDbwEvaluator, 
the values for k-means and fuzzy-k are clearly incorrect. The values for 
Canopy, MeanShift and Dirichlet are not so obviously incorrect but I 
remain suspicious. Something must have become broken in the recent 
clustering refactoring.

 From the method CDbwEvaluator.invalidCluster comment (used to enable 
pruning):
    * Return if the cluster is valid. Valid clusters must have more than 
2 representative points,
    * and at least one of them must be different than the cluster 
center. This is because the
    * representative points extraction will duplicate the cluster center 
if it is empty.

Oddly enough, inspection of the test log indicates that only k-means and 
fuzzy-k are not pruning clusters. Clearly some more investigation is 
needed. I will take a look at it tomorrow. In the mean time if you 
develop any additional insight please do share it with us.

Thanks,
Jeff

On 5/17/12 3:53 PM, Pat Ferrel wrote:
> I built a tool that iterates through a list of values for k on the 
> same data and spits out the CDbw and ClusterEvaluator results each time.
>
> When the evaluator or CDbw prunes a cluster, how do I interpret that? 
> They seem to throw out the same clusters on a given run. Also CDbw 
> always returns an inter-cluster density of 0?
>
> On 5/17/12 5:58 AM, Jeff Eastman wrote:
>> Yes, that is the paper I used to implement CDbw. I've tried it a few 
>> times along with the simpler ClusterEvaluator metrics I took from 
>> Mahout In Action and they look to be reasonable - see the tests - 
>> though I have no way to judge their absolute values. Anything you can 
>> contribute in this area would be most welcome. Perhaps a wiki page?
>>
>>
>> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>>> The reference was in the code for 
>>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>>
>>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>>> Thanks, I've been looking at that. Is there a description of how to 
>>>> interpret those values? An academic paper maybe? The intra-cluster 
>>>> distance intuitively seems to correspond to something like 
>>>> cohesion. I don't get the intuition behind inter-cluster distances 
>>>> but Ted thinks they are the most important.
>>>>
>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute 
>>>>> some quality metrics (inter-cluster distance, 
>>>>> intra-cluster-distance, ...) that you may find useful. Both 
>>>>> calculate a set of representative points from the clustering 
>>>>> output and compute the (n^2) metrics over these points rather than 
>>>>> all of the points in each cluster.
>>>>>
>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>>> So many questions about best k, how to choose t1 and t2, how much

>>>>>> help is dimensional reduction would have clear answers if we had

>>>>>> a way to judge the quality of clusters.
>>>>>>
>>>>>> Various methods were discussed here for a time: 
>>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
>>>>>>
>>>>>> Has there been any work on building a measure of quality?
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message