mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yiqun Hu (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
Date Tue, 18 Jun 2013 03:41:20 GMT


Yiqun Hu commented on MAHOUT-1214:

In our case, it is actually 3. Anyway, before raise the ticket, I will generate a test case
to verify first. Since you say that the size should be the cardinality, I think we will double
check the way the eigenvector are read to see if we made any mistake there such that in our
case  the size() equal to 3 instead of 6
Sent from Mailbox for iPhone

On Tue, Jun 18, 2013 at 11:27 AM, Robin Anil (JIRA) <>

> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>                 Key: MAHOUT-1214
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Mahout 0.7
>            Reporter: Yiqun Hu
>            Assignee: Robin Anil
>              Labels: clustering, improvement
>             Fix For: 0.8
>         Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002)
in version 0.7 has two serious issues. These two incorrect implementations make it fail even
for a very obvious trivial dataset. We have implemented a solution to resolve these two issues
and hope to contribute back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors,
which is necessary to obtain the correct clustering results for the case of K>1; We have
an idea and implementation to select based on cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and sometimes a bad
initialization will generate wrong clustering result. In this case, the selected K eigenvector
actually provides a better way to initalize cluster centroids because each selected eigenvector
is a relaxed indicator of the memberships of one cluster. For every selected eigenvector,
we use the data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows that the improved
version get the optimal clustering result while the current 0.7 version obtains the wrong

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message