mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sharath jagannath <>
Subject Re: Clustering with KMeans
Date Wed, 09 Feb 2011 06:29:02 GMT
Yeah, that is the case actually. But they were of same dataset.
My dataset is rather too small, I had around 1200 data points. I clustered
1190 in the first run.
Used  the remaining 10 as the test data.

I have used the same set of vectorizers and drivers for both of them.
Only thing that I did not do in the test phase was create canopies using the
new data-points.
I used the canopies that was created in the training phase and guessed it
would work. Since the

thus the reason why they have different size. But I assumed the algorithm
works that way.
Correct me if I am wrong.

Things are working fine if I create canopies with training data + test data.
But I really do not want to do it unless that is the right way.

I read the following online clustering method in the Mahout in Action and
was trying to create such a system before I start doing anything else

1. Cluster 1 million articles as above and save the cluster centroids for
all clusters

2. Periodically, for each new article, use canopy clustering to assign it to
the cluster whose centroid is closest based on a very smal distance
threshold. This ensures that articles on topics that occurred previously are
associated with that topic cluster and shown instantly on the website. These
documents are removed from the new document list.

3. The left over documents, which are not associated with any old cluster,
forms new canopies. These canopies represent new topics that appeared in the
news that has little or no match with any articles that we have from the

4. Use the new canopy centroids and cluster the articles that are not
associated with any of the old clusters and add these temporary cluster
centroids to our centroid list.

5. Less frequently, execute the full batch clustering to re-cluster the
entire set of documents. While doing so, it is useful to keep all previous
cluster centroids as input to the algorithm so that clustering achieves
faster convergence.

And have not done much progress though :D. I would have liked to see this
working by now.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message