mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Eastman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1020) The Cluster Evaluator is returning bad results
Date Fri, 01 Jun 2012 14:51:24 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287461#comment-13287461
] 

Jeff Eastman commented on MAHOUT-1020:
--------------------------------------

It looks like new kmeansOutput and fuzzyKMeansOutput paths were introduced for the clustering
output and the representative points computation was not updated so none were produced. I've
fixed those two tests and now the CDbw results look more respectable. Committing these changes
now.
                
> The Cluster Evaluator is returning bad results
> ----------------------------------------------
>
>                 Key: MAHOUT-1020
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1020
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Various environments and data sets. Mahout 0.6, 0.7 trunk not tested.
>            Reporter: Pat Ferrel
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>
> Conversation with between Pat Ferrel and Jeff Eastman on the user list
> Hi Pat,
> I don't have a good answer here. Evidently, something in CDbw has become broken and you
are the first to notice. When I run TestCDbwEvaluator, the values for k-means and fuzzy-k
are clearly incorrect. The values for Canopy, MeanShift and Dirichlet are not so obviously
incorrect but I remain suspicious. Something must have become broken in the recent clustering
refactoring.
> From the method CDbwEvaluator.invalidCluster comment (used to enable pruning):
>    * Return if the cluster is valid. Valid clusters must have more than 2 representative
points,
>    * and at least one of them must be different than the cluster center. This is because
the
>    * representative points extraction will duplicate the cluster center if it is empty.
> Oddly enough, inspection of the test log indicates that only k-means and fuzzy-k are
not pruning clusters. Clearly some more investigation is needed. I will take a look at it
tomorrow. In the mean time if you develop any additional insight please do share it with us.
> Thanks,
> Jeff
> On 5/17/12 3:53 PM, Pat Ferrel wrote:
> > I built a tool that iterates through a list of values for k on the same data and
spits out the CDbw and ClusterEvaluator results each time.
> >
> > When the evaluator or CDbw prunes a cluster, how do I interpret that? They seem
to throw out the same clusters on a given run. Also CDbw always returns an inter-cluster density
of 0?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message