Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of hadfield.marc@gmail.com
 designates 74.125.82.170 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:reply-to:in-reply-to:references:date:message-id
         :subject:from:to:cc:content-type;
        b=fS2nuob7XmSJErQStjeStp0R5CyYI6clfRnaN6ANIP0JJJaGHiwrNuUrtv60/Z2+is
         NJVDKxr9r4YGG50AOpFT+b16gU9L4lTtWxFWIOX5e6AwEP4Y5AHc6wpx2/IzVoK9zbve
         I2fplnm5WpdfiQnbUyUrQUuOqXojH0naP3c8A=
MIME-Version: 1.0
Reply-To: marc@alitora.com
In-Reply-To: <AANLkTimg=i6GoyEeap_Ynh8TMi0+qsyWK5RzGkXsJ1Yy@mail.gmail.com>
References: <AANLkTikLsjxWaevMWD6zrrDRdmoaNTNMGuBSaSJ_Uq1A@mail.gmail.com>
	<AANLkTimg=i6GoyEeap_Ynh8TMi0+qsyWK5RzGkXsJ1Yy@mail.gmail.com>
Date: Mon, 7 Feb 2011 14:18:24 -0500
Message-ID: <AANLkTimd-q7jj_fGmmpa_NV5G7weMGJMA9RLXT7T1YbP@mail.gmail.com>
Subject: Re: Memory Issue with KMeans clustering
From: Marc Hadfield <hadfield.marc@gmail.com>
To: Ted Dunning <ted.dunning@gmail.com>
Cc: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=001636498dab8c49f2049bb618c7

--001636498dab8c49f2049bb618c7
Content-Type: text/plain; charset=ISO-8859-1

Great, thanks for the info!


On Mon, Feb 7, 2011 at 2:12 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

>
>
> On Mon, Feb 7, 2011 at 10:43 AM, Marc Hadfield <hadfield.marc@gmail.com>wrote:
>
>>
>>
>> In the case outlined below, does that mean each node of a hadoop cluster
>> would need to have the centroid information fully in memory for k-means, or
>> is this spread over the cluster in some way?
>>
>
> Yes.  Every node needs every centroid in memory.
>
>
>>
>> if each node has to have the centroid information fully in memory, are
>> there any other data structures which need to be fully in memory in each
>> node, and if so, what are they proportional to (again, specifically for
>> k-means)?  i.e. is anything memory resident related to the number of
>> documents?
>>
>
> No.  Just centroids.  Of course, if you have sparse centroids, then the
> number of non-zero elements will increase roughly with the log of hte number
> of documents, but if you have space for the dense version of the centroid,
> then nothing should scale with the number of documents.
>
>
>>
>> If the centroid information (dependent on the number of features and
>> clusters) needs to be fully in memory in all hadoop nodes, but not anything
>> related to the number of documents, then the k-means algorithm would be
>> scalable in the number of documents (just add more hadoop nodes to increase
>> document throughput), but *not* scalable in the number of clusters /
>> features since the algorithm requires a full copy of this information in
>> each node.  is this accurate?
>>
>
> Yes.
>
> Scalability in the number of features can be achieved by using a hashed
> encoding.
>
> Scalability in the number of centroids can be achieved by changing the code
> a bit so that the centroid sets are spread across several nodes.  That would
> require come cleverness in the input format so that each split is sent to
> several nodes.  An alternative would be to add an extra map-reduce step
> where the first reducer is whether the partial classification is done.
>
> My guess is that scaling the number of centroids isn't a great idea beyond
> a moderate size because k-means will break down.  Better to do hierarchical
> clustering to get very fine distinctions.  That should be doable in a much
> more scalable way.
>

--001636498dab8c49f2049bb618c7--