Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 58731 invoked from network); 7 Feb 2011 19:18:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Feb 2011 19:18:55 -0000 Received: (qmail 80499 invoked by uid 500); 7 Feb 2011 19:18:54 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 80416 invoked by uid 500); 7 Feb 2011 19:18:53 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 80408 invoked by uid 99); 7 Feb 2011 19:18:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Feb 2011 19:18:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hadfield.marc@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-wy0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Feb 2011 19:18:45 +0000 Received: by wyb39 with SMTP id 39so4699483wyb.1 for ; Mon, 07 Feb 2011 11:18:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=WZMJ0McoeduvGrMr0+7ZCN9TQrzRoO7oZB5uqX02UKE=; b=Jv3WgJUemiRTg3OVhJEWKulL7OMUzY4iU9/a0MLuAXNyb5GwK4ySmh5FP5oWJKSWSf KMMozCk5vecQy6VnO1rK3tbbuEKvxSDdbwVgSJv5PUyQZ/sQs+HAtFsaYfq5oztCsjwH zYIgcZl1U2DY4Zx7IbP2ttb0Jl62v93ofQiEg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; b=fS2nuob7XmSJErQStjeStp0R5CyYI6clfRnaN6ANIP0JJJaGHiwrNuUrtv60/Z2+is NJVDKxr9r4YGG50AOpFT+b16gU9L4lTtWxFWIOX5e6AwEP4Y5AHc6wpx2/IzVoK9zbve I2fplnm5WpdfiQnbUyUrQUuOqXojH0naP3c8A= MIME-Version: 1.0 Received: by 10.227.132.149 with SMTP id b21mr5584972wbt.48.1297106305042; Mon, 07 Feb 2011 11:18:25 -0800 (PST) Received: by 10.216.54.138 with HTTP; Mon, 7 Feb 2011 11:18:24 -0800 (PST) Reply-To: marc@alitora.com In-Reply-To: References: Date: Mon, 7 Feb 2011 14:18:24 -0500 Message-ID: Subject: Re: Memory Issue with KMeans clustering From: Marc Hadfield To: Ted Dunning Cc: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001636498dab8c49f2049bb618c7 X-Virus-Checked: Checked by ClamAV on apache.org --001636498dab8c49f2049bb618c7 Content-Type: text/plain; charset=ISO-8859-1 Great, thanks for the info! On Mon, Feb 7, 2011 at 2:12 PM, Ted Dunning wrote: > > > On Mon, Feb 7, 2011 at 10:43 AM, Marc Hadfield wrote: > >> >> >> In the case outlined below, does that mean each node of a hadoop cluster >> would need to have the centroid information fully in memory for k-means, or >> is this spread over the cluster in some way? >> > > Yes. Every node needs every centroid in memory. > > >> >> if each node has to have the centroid information fully in memory, are >> there any other data structures which need to be fully in memory in each >> node, and if so, what are they proportional to (again, specifically for >> k-means)? i.e. is anything memory resident related to the number of >> documents? >> > > No. Just centroids. Of course, if you have sparse centroids, then the > number of non-zero elements will increase roughly with the log of hte number > of documents, but if you have space for the dense version of the centroid, > then nothing should scale with the number of documents. > > >> >> If the centroid information (dependent on the number of features and >> clusters) needs to be fully in memory in all hadoop nodes, but not anything >> related to the number of documents, then the k-means algorithm would be >> scalable in the number of documents (just add more hadoop nodes to increase >> document throughput), but *not* scalable in the number of clusters / >> features since the algorithm requires a full copy of this information in >> each node. is this accurate? >> > > Yes. > > Scalability in the number of features can be achieved by using a hashed > encoding. > > Scalability in the number of centroids can be achieved by changing the code > a bit so that the centroid sets are spread across several nodes. That would > require come cleverness in the input format so that each split is sent to > several nodes. An alternative would be to add an extra map-reduce step > where the first reducer is whether the partial classification is done. > > My guess is that scaling the number of centroids isn't a great idea beyond > a moderate size because k-means will break down. Better to do hierarchical > clustering to get very fine distinctions. That should be doable in a much > more scalable way. > --001636498dab8c49f2049bb618c7--