mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hadfield <hadfield.m...@gmail.com>
Subject Re: Memory Issue with KMeans clustering
Date Mon, 07 Feb 2011 19:18:24 GMT
Great, thanks for the info!




On Mon, Feb 7, 2011 at 2:12 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

>
>
> On Mon, Feb 7, 2011 at 10:43 AM, Marc Hadfield <hadfield.marc@gmail.com>wrote:
>
>>
>>
>> In the case outlined below, does that mean each node of a hadoop cluster
>> would need to have the centroid information fully in memory for k-means, or
>> is this spread over the cluster in some way?
>>
>
> Yes.  Every node needs every centroid in memory.
>
>
>>
>> if each node has to have the centroid information fully in memory, are
>> there any other data structures which need to be fully in memory in each
>> node, and if so, what are they proportional to (again, specifically for
>> k-means)?  i.e. is anything memory resident related to the number of
>> documents?
>>
>
> No.  Just centroids.  Of course, if you have sparse centroids, then the
> number of non-zero elements will increase roughly with the log of hte number
> of documents, but if you have space for the dense version of the centroid,
> then nothing should scale with the number of documents.
>
>
>>
>> If the centroid information (dependent on the number of features and
>> clusters) needs to be fully in memory in all hadoop nodes, but not anything
>> related to the number of documents, then the k-means algorithm would be
>> scalable in the number of documents (just add more hadoop nodes to increase
>> document throughput), but *not* scalable in the number of clusters /
>> features since the algorithm requires a full copy of this information in
>> each node.  is this accurate?
>>
>
> Yes.
>
> Scalability in the number of features can be achieved by using a hashed
> encoding.
>
> Scalability in the number of centroids can be achieved by changing the code
> a bit so that the centroid sets are spread across several nodes.  That would
> require come cleverness in the input format so that each split is sent to
> several nodes.  An alternative would be to add an extra map-reduce step
> where the first reducer is whether the partial classification is done.
>
> My guess is that scaling the number of centroids isn't a great idea beyond
> a moderate size because k-means will break down.  Better to do hierarchical
> clustering to get very fine distinctions.  That should be doable in a much
> more scalable way.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message