Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8B5C61168A for ; Mon, 28 Jul 2014 19:46:07 +0000 (UTC) Received: (qmail 11792 invoked by uid 500); 28 Jul 2014 19:46:05 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 11731 invoked by uid 500); 28 Jul 2014 19:46:05 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 11719 invoked by uid 99); 28 Jul 2014 19:46:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jul 2014 19:46:04 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.220.41 as permitted sender) Received: from [209.85.220.41] (HELO mail-pa0-f41.google.com) (209.85.220.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jul 2014 19:45:58 +0000 Received: by mail-pa0-f41.google.com with SMTP id rd3so11053952pab.14 for ; Mon, 28 Jul 2014 12:45:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:references:from:content-type:in-reply-to:message-id:date:to :content-transfer-encoding:mime-version; bh=8alDoKrHRFO6S4FgxG/WT0BC0Yl7TrFuj+6dLoTe2UA=; b=A7aRYFQPGqIJIP1vRsjQSFTiTvFawMXCLd7FOEraFor6NxRFOXOjiQIcJPke+PVMvf Rxo0BhU3Z8IyW3+lz0uKPY1g7/pALIgzsmC4gCZu5w65L04i/r1Uv8HhQdLOCJrhELov +3tbXuGosmsXA5IZL5qFKMY8jHGkjm/LssoyzLk6LrcYTqP5b4e3e1TmkiXaHr83jR/F lInc7+JjGBJdx1N8hU14/QJs8yHT25qZ+tPfrSYmU1eaaQaDXSzwWRz0i6Ldc4Fs847I VEjv6TdXzUvG5pgukHTCmRln+lpg1uxCeWp9HisoYsvpsF2Aodb0UTI3YooWNzaFszmf EhlA== X-Received: by 10.68.103.66 with SMTP id fu2mr6290354pbb.133.1406576737935; Mon, 28 Jul 2014 12:45:37 -0700 (PDT) Received: from [10.180.216.211] (mobile-166-137-182-174.mycingular.net. [166.137.182.174]) by mx.google.com with ESMTPSA id h6sm25519223pdn.80.2014.07.28.12.45.36 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 28 Jul 2014 12:45:36 -0700 (PDT) Subject: Re: Streaming kmeans question References: From: Ted Dunning Content-Type: text/plain; charset=utf-8 X-Mailer: iPhone Mail (11D201) In-Reply-To: Message-Id: Date: Mon, 28 Jul 2014 13:45:34 -0600 To: "user@mahout.apache.org" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) X-Virus-Checked: Checked by ClamAV on apache.org I am traveling and it is difficult to get a real internet connection.=20 Here is an answer one of your questions.=20 For very dimension data, some kind of dimensionality reduction is usually im= portant. The streaming k-means code does the by approximating the nearest ce= ntroid by using a random projection.=20 Note that the output of the streaming step is *not* a set of initial centroi= ds. Instead it is a large number of centroids which are clustered as a surro= gate for the original data. These centroids are much less numerous than the= original data so the final ball k-means can run in memory. This is very dif= ferent than the canopy approach.=20 There is a known issue with the map-reduce version of the streaming k-means p= rogram that causes the number of centroids output by the parallel part of th= e algorithm to be too large.=20 There is a known issue Sent from my iPhone > On Jul 28, 2014, at 3:08, Bojan Kosti=C4=87 wrote:= >=20 > Also as i see this stream kmeans is for large sets of data. Does this larg= e > means large number of points and not dimmensions? And what to do when data= > have large dimensions? Like more then 1000000 dimensions.