mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
Date Mon, 20 May 2013 06:29:16 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661807#comment-13661807
] 

Ted Dunning commented on MAHOUT-1214:
-------------------------------------

The accuracy should be quite good if you use a single power step.

You can play with the algorithm using the R version of the algorithm[1].

See also Nathan Halko's dissertation and the arxiv paper on the subject [2].

The original JIRA issues [3,4] should be helpful as well.  Attached [5] to these
JIRA's is a description of an early version of the algorithm that was implemented.
Dmitriy developed alternatives for some of the steps and implemented a power step
to improve accuracy [6].

[1] https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html

[2] http://arxiv.org/abs/0909.4061

[3] https://issues.apache.org/jira/browse/MAHOUT-792

[4] https://issues.apache.org/jira/browse/MAHOUT-797

[5] https://issues.apache.org/jira/secure/attachment/12491074/sd-2.pdf

[6] https://issues.apache.org/jira/secure/attachment/12493978/MAHOUT-797.pdf

                
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
>                 Key: MAHOUT-1214
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Mahout 0.7
>            Reporter: Yiqun Hu
>              Labels: clustering, improvement
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002)
in version 0.7 has two serious issues. These two incorrect implementations make it fail even
for a very obvious trivial dataset. We have implemented a solution to resolve these two issues
and hope to contribute back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors,
which is necessary to obtain the correct clustering results for the case of K>1; We have
an idea and implementation to select based on cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and sometimes a bad
initialization will generate wrong clustering result. In this case, the selected K eigenvector
actually provides a better way to initalize cluster centroids because each selected eigenvector
is a relaxed indicator of the memberships of one cluster. For every selected eigenvector,
we use the data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows that the improved
version get the optimal clustering result while the current 0.7 version obtains the wrong
result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message