mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: question about clustering
Date Thu, 06 Oct 2011 18:54:25 GMT

On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:

> Hi ,
> i have used mahout to produce kmeans  clustering for my tf-idf result. I use
> the mahout command line to produce the clusters and it seems it successfully
> completes.
> $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
> ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> It seems there are two clusters directory generated.(cluster-1 and
> cluster-2)  , when i use clusterdump on each of them, it seems to me that
> the clustered top terms are the same. Any idea why ?

The top terms are exactly that, the top terms.  It is not all of the terms.  My guess is that
things don't change much between the two iterations.

> Also, how can i see which documents have been assigned to each cluster.
> Right now, i can see the number of documents assigned but not the complete
> list.

Add the --clustering flag.  By default, K-Means just calculates the centroids.  If you want
to know membership, the --clustering flag does that.

> Most importantly, for production purposes, i assume it makes sense for
> kmeans always runs on hadoop to generate the clustering file. But how do i
> consume these during serving ? Ideally, serving should have the doc id or
> query passed as a query, and the server should return the top document
> ranked by the score within the same cluster back. How do I do it in code ?
> Any good examples ?

Presumably, you have to load up the centroids and/or the results and see which cluster the
new item belongs to.

> Thanks a lot,
> Weide

Grant Ingersoll
Lucene Eurocon 2011:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message