Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3CF5F48F for ; Fri, 29 Mar 2013 02:30:50 +0000 (UTC) Received: (qmail 41358 invoked by uid 500); 29 Mar 2013 02:30:49 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 41289 invoked by uid 500); 29 Mar 2013 02:30:49 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 41266 invoked by uid 99); 29 Mar 2013 02:30:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 02:30:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.210.182 as permitted sender) Received: from [209.85.210.182] (HELO mail-ia0-f182.google.com) (209.85.210.182) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 02:30:44 +0000 Received: by mail-ia0-f182.google.com with SMTP id u8so128515iag.27 for ; Thu, 28 Mar 2013 19:30:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=MnTshpoFOX3CvKIRZcdiF1wXVUDgnVKXjrUhptgmDMc=; b=Jes/dNoSA7Qo4DxLqdx5Gsixuj8My2edX7a7TAMt88pO+uX+nkw39AChcWWcYLLcMK 85sbiB+FT/ojnGyk6vVdOmDIXI6UbiRAuRQhNJuKyYQvp+E4liAVf0WK/jYLuV29RVfE OmI5wTnab7yC57NE1a9mHiBVBiyLQWIOSCeiGkksY2LxQ3AsaHTPgnUd81rV09/Zn8VS kUHxFFFaHeELqCVaPZoytpHqoBYgpVqbUU3jAsxkqSexlu4EVlGNMJACGkwcLQmt9kjx IZy9dXJdlulnHPGQqdpBNPo7ObmrDSldqJUq/YYTleosXJQjvGAs09aTFSjot+0QOSsb VHqA== X-Received: by 10.50.57.166 with SMTP id j6mr634417igq.21.1364524223781; Thu, 28 Mar 2013 19:30:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.162.130 with HTTP; Thu, 28 Mar 2013 19:29:53 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Fri, 29 Mar 2013 03:29:53 +0100 Message-ID: Subject: Re: streaming k-means vs minibatch k-means To: dev@mahout.apache.org Content-Type: multipart/alternative; boundary=14dae934036fa5635f04d9070d70 X-Virus-Checked: Checked by ClamAV on apache.org --14dae934036fa5635f04d9070d70 Content-Type: text/plain; charset=UTF-8 I think (casually, informally) that the guarantees are preserved by simple ordering arguments. The argument goes that the threshold on partitions will grow less than if the partitions were handled sequentially. Thus the partial sketches should have at least as much fidelity than if the same segment were before other segments. Secondly, the total number of sketch clusters \kappa in M sketches across N points is \kappa > k log N for the sequential case and \kappa > M ( k log N/M ) = M k log N - M k log M But M << N so log N > log M. M >= 2 and M < sqrt(N) then \kappa > M ( k log N/M ) = M k log N - M k log M >= M/2 k log N > k log N if M = 1, the bound becomes equality (whew) This is a pretty reasonable constraint (i.e use less than 1000 mappers on 1e6 points and less than 10,000 mappers on 1e9 points). Thus the map-reduce sketch is commonly more detailed in both senses for common cases (smaller threshold, more sketch centroids). This should result in as good or better theoretic results. On Thu, Mar 28, 2013 at 10:30 PM, Andy Twigg wrote: > Dan/Ted: > > I like that you are implementing streaming k-means. > > Are there any results comparing it to mini batch k-means ([1] and the > paper cited therein) ? > > In the distributed implementation, you independently compute a > O(k)-means clustering on each partition, then combine them into a > final k-means. Are there any guarantees/results about the accuracy of > this? Clearly this sort of design also favours a storm/spark > implementation - have you considered that? > > -Andy > > > > [1] > http://scikit-learn.org/dev/modules/generated/sklearn.cluster.MiniBatchKMeans.html > --14dae934036fa5635f04d9070d70--