mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pat Ferrel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1045) Cluster evaluators returning bad results
Date Sun, 15 Jul 2012 22:24:34 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414795#comment-13414795
] 

Pat Ferrel commented on MAHOUT-1045:
------------------------------------

I think your conclusion about one NaN leading to average = NaN is correct. Sean and I suspected
as much above. I described several things you can get in the real world that cause trouble,
clusters with many identical points and points with no dimensions.

Looking at 33465 this seems to be the case. I would suggest that these clusters either have
bad representative points or are not valid clusters for inclusion in the average intra-cluster
density. if they have a centroid with dimensions they would still be useful for inter-cluster
density but just because one has a NaN intra-cluster density doesn't mean the whole intra-cluster
density average should be horked. Why not just remove them from consideration?

It seems to me this is the case of a bunch of docs that have the same content forming a cluster.
This will happen all the time in real world crawls as I said above. But maybe an infinite
density should be considered a fringe case and removed from the average. 
                
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is almost
always NaN. The CDbw inter-cluster density is almost always 0. I have also seen several cases
where CDbw fails to return any results but have not tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message