mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: k-Means questions
Date Thu, 25 Jun 2009 23:00:46 GMT
On Thu, Jun 25, 2009 at 3:49 PM, Grant Ingersoll <>wrote:

> Do people have recommendations for start clusters (seeds) for k-Means.  The
> synthetic control example uses Canopy and I often see Random selection
> mentioned, but I'm wondering what's considered to be best practices for
> obtaining good overall results.

Just picking a random data element for each centroid should work well.
Random assignment works much less well because all of the centroids get put
very close to the mean of the entire data set.  Having them be separated
actually helps (usually).  Assigning just a few (2-5) elements to each
centroid can also be done.

Also, how best to take the Random approach.  On a small data set, I can
> easily crank out a program to loop randomly select vectors, but it seems
> like in a HDFS environment, you'd need a M/R job just to do that initial
> selection of random documents.

Not a big deal.  You don't need to do much even if the data is in non-random
order.   Picking seeds from a short prefix of the data isn't a big problem.

Back in my parallel computation days (a _long_ time ago) on big old iron, I
> seem to recall there being work on parallel/distributed RNG, is that useful
> here or is that overkill?  Does Hadoop offer tools for this?

The simplest answer is to seed your PRNG with a hash formed from the job
step name and the input split details.

>  Also, is it just me, or does the KMeansDriver need to take in "k" or is
> this just assumed from the number of initial input clusters?

There should be the same number of initial as there are final clusters (for

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message