mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Memory Issue with KMeans clustering
Date Mon, 07 Feb 2011 19:12:32 GMT
On Mon, Feb 7, 2011 at 10:43 AM, Marc Hadfield <>wrote:

> In the case outlined below, does that mean each node of a hadoop cluster
> would need to have the centroid information fully in memory for k-means, or
> is this spread over the cluster in some way?

Yes.  Every node needs every centroid in memory.

> if each node has to have the centroid information fully in memory, are
> there any other data structures which need to be fully in memory in each
> node, and if so, what are they proportional to (again, specifically for
> k-means)?  i.e. is anything memory resident related to the number of
> documents?

No.  Just centroids.  Of course, if you have sparse centroids, then the
number of non-zero elements will increase roughly with the log of hte number
of documents, but if you have space for the dense version of the centroid,
then nothing should scale with the number of documents.

> If the centroid information (dependent on the number of features and
> clusters) needs to be fully in memory in all hadoop nodes, but not anything
> related to the number of documents, then the k-means algorithm would be
> scalable in the number of documents (just add more hadoop nodes to increase
> document throughput), but *not* scalable in the number of clusters /
> features since the algorithm requires a full copy of this information in
> each node.  is this accurate?


Scalability in the number of features can be achieved by using a hashed

Scalability in the number of centroids can be achieved by changing the code
a bit so that the centroid sets are spread across several nodes.  That would
require come cleverness in the input format so that each split is sent to
several nodes.  An alternative would be to add an extra map-reduce step
where the first reducer is whether the partial classification is done.

My guess is that scaling the number of centroids isn't a great idea beyond a
moderate size because k-means will break down.  Better to do hierarchical
clustering to get very fine distinctions.  That should be doable in a much
more scalable way.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message