mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pat Ferrel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1045) Cluster evaluators returning bad results
Date Mon, 16 Jul 2012 16:31:34 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415359#comment-13415359
] 

Pat Ferrel commented on MAHOUT-1045:
------------------------------------

The evaluator code does no error checking and so assumes all input is valid. My style would
be to put checks for edge conditions in the evaluators. Like make sure the denominator is
never 0, etc. This might hide some deeper problems though. 

I assume what you are saying about the same doc name means the same item was chosen five times?
I strongly suspect that there will be cases where an identical weighted vector will have n
different names so you can't get away with checking for uniqueness of representative points
alone, you will still have the problem of a singularity (borrowing a physics term) cluster.
The clustering algorithm may even accidentally make the centroid the same as the rest of the
points and I suspect that would cause different problems. I think these cases are all fairly
likely to come up in large crawls.

Not sure what else the pruning process is for but in this case I'd toss the cluster from the
intra-cluster evaluation but not necessarily the inter-cluster density eval (though it might
break some math there too). Which leans us towards scapping the pruning for evaluation because
it removes the cluster from both calculations and maybe others too? 

If pruning is supposed to catch all undesirable conditions for all evaluations it seems like
a lot of coupling with the evaluation algorithms and therefore fragile with respect to changes
in algorithm and data conditions.

So I guess I agree with your last statement.
                
> Cluster evaluators returning bad results
> ----------------------------------------
>
>                 Key: MAHOUT-1045
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6, 0.7, 0.8
>         Environment: Several environments and data sets
>            Reporter: Pat Ferrel
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1045.patch, first-time-density-nan.txt
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is almost
always NaN. The CDbw inter-cluster density is almost always 0. I have also seen several cases
where CDbw fails to return any results but have not tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message