Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 86FC2200C57 for ; Fri, 31 Mar 2017 20:05:01 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 85A5C160B80; Fri, 31 Mar 2017 18:05:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8F482160B7C for ; Fri, 31 Mar 2017 20:05:00 +0200 (CEST) Received: (qmail 62826 invoked by uid 500); 31 Mar 2017 18:04:59 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 62813 invoked by uid 99); 31 Mar 2017 18:04:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Mar 2017 18:04:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D33EFC151A for ; Fri, 31 Mar 2017 18:04:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.629 X-Spam-Level: ** X-Spam-Status: No, score=2.629 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id AQvBc5ArVz6P for ; Fri, 31 Mar 2017 18:04:56 +0000 (UTC) Received: from mail-vk0-f49.google.com (mail-vk0-f49.google.com [209.85.213.49]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 4B3955F58E for ; Fri, 31 Mar 2017 18:04:56 +0000 (UTC) Received: by mail-vk0-f49.google.com with SMTP id d188so99425190vka.0 for ; Fri, 31 Mar 2017 11:04:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=Or5tCu8VXJ6Qdvl0Io0AXsUdeo9jiXNcIJUUNIc5fOs=; b=SGVB3G7H+C9nLvKwGgpeqZGAEuHF/S3Il3h5yNtRNUh2NMOt4yyQrhO7vS2xj6TZLp FpIYhVxAR3kT/0Fm5whLqk32DaH4VyHcK94iTM6rIu09pCE/BBVq+DDOQ6rYUlA+H707 Jv4nBlkJ6hOEy0KALchxi3ss5IJ9zRr2JOElbPe0Db/OAuJ31C1yeyk+36imUFqDbuxu R+bcJ2Q4AqvCP3GNhAEEdz7zXsljkZ5RchDjJ0kgaBGY45g6Px0Lo1qpLqGiFS25b3tN 7CJqJr+L+5AyQ8ZhvzT8pn5WUX81sWsfR6uxuhid/Hf1Wcy25CAs/m4wVeXXtOjpulZ+ Tj7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=Or5tCu8VXJ6Qdvl0Io0AXsUdeo9jiXNcIJUUNIc5fOs=; b=OIf5yzuQXnwGGja3uWdwkcUrkuYUN24I8jqd33bUx3y8/5+/Q0ryd6cdx4I42N2v7o pQ3Ew/LrXH93fqvM84/XFwz55QzAfrdfr/iRby/mnRurRJKNzAd+J8mY9rpT1kOnoJDK PXW3lUVscSNcN99Ju989Xl/zPtvCk8lCRsOzzIfkXZqcrcsI8Oe0nLdTYwagtYQ7FPS8 +z/92Jomb4ExibBi8TtwfMDySygJj02LdlzCMeYWf02V1QHao42CdrofbH3lp+1Lg88f +vNUPxwa6RppmPWMGqFbfmDy+Gl67DIKS2vWKrZpq3XWjO8OqnJNGTLOs6IBBlIoMrdc Tjjg== X-Gm-Message-State: AFeK/H3aY+6e5n3aPHVYtRwVf7N4sutsXcKIQeU8Xww1xLupCNxqzpddonOeF5ux3Jy9PwExLuYWt1aamjx2Gg== X-Received: by 10.159.35.81 with SMTP id 75mr1805194uae.27.1490983495605; Fri, 31 Mar 2017 11:04:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.24.76 with HTTP; Fri, 31 Mar 2017 11:04:55 -0700 (PDT) In-Reply-To: References: From: Dmitriy Lyubimov Date: Fri, 31 Mar 2017 11:04:55 -0700 Message-ID: Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara" To: "dev@mahout.apache.org" Content-Type: multipart/alternative; boundary=001a113ac9ea9f261e054c0aa5ef archived-at: Fri, 31 Mar 2017 18:05:01 -0000 --001a113ac9ea9f261e054c0aa5ef Content-Type: text/plain; charset=UTF-8 ps1 this assumes row-wise construction of A based on training set of m n-dimensional points. ps2 since we are doing multiple passes over A it may make sense to make sure it is committed to spark cache (by using checkpoint api), if spark is used On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov wrote: > here is the outline. For details of APIs, please refer to samsara manual > [2], i will not be be repeating it. > > Assume your training data input is m x n matrix A. For simplicity let's > assume it's a DRM with int row keys, i.e., DrmLike[Int]. > > Initialization: > > First, classic k-means starts by selecting initial clusters, by sampling > them out. You can do that by using sampling api [1], thus forming a k x n > in-memory matrix C (current centroids). C is therefore of Mahout's Matrix > type. > > You the proceed by alternating between cluster assignments and > recompupting centroid matrix C till convergence based on some test or > simply limited by epoch count budget, your choice. > > Cluster assignments: here, we go over current generation of A and > recompute centroid indexes for each row in A. Once we recompute index, we > put it into the row key . You can do that by assigning centroid indices to > keys of A using operator mapblock() (details in [2], [3], [4]). You also > need to broadcast C in order to be able to access it in efficient manner > inside mapblock() closure. Examples of that are plenty given in [2]. > Essentially, in mapblock, you'd reform the row keys to reflect cluster > index in C. while going over A, you'd have a "nearest neighbor" problem to > solve for the row of A and centroids C. This is the bulk of computation > really, and there are a few tricks there that can speed this step up in > both exact and approximate manner, but you can start with a naive search. > > Centroid recomputation: > once you assigned centroids to the keys of marix A, you'd want to do an > aggregating transpose of A to compute essentially average of row A grouped > by the centroid key. The trick is to do a computation of (1|A)' which will > results in a matrix of the shape (Counts/sums of cluster rows). This is the > part i find difficult to explain without a latex graphics. > > In Samsara, construction of (1|A)' corresponds to DRM expression > > (1 cbind A).t (again, see [2]). > > So when you compute, say, > > B = (1 | A)', > > then B is (n+1) x k, so each column contains a vector corresponding to a > cluster 1..k. In such column, the first element would be # of points in the > cluster, and the rest of it would correspond to sum of all points. So in > order to arrive to an updated matrix C, we need to collect B into memory, > and slice out counters (first row) from the rest of it. > > So, to compute C: > > C <- B (2:,:) each row divided by B(1,:) > > (watch out for empty clusters with 0 elements, this will cause lack of > convergence and NaNs in the newly computed C). > > This operation obviously uses subblocking and row-wise iteration over B, > for which i am again making reference to [2]. > > > [1] https://github.com/apache/mahout/blob/master/math-scala/ > src/main/scala/org/apache/mahout/math/drm/package.scala#L149 > > [2], Sasmara manual, a bit dated but viable, http://apache.github. > io/mahout/doc/ScalaSparkBindings.html > > [3] scaladoc, again, dated but largely viable for the purpose of this > exercise: > http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm > > [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout- > math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps > > On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < > h2016170@pilani.bits-pilani.ac.in> wrote: > >> @Dmitriycan you please again tell me the approach to move ahead. >> >> >> Thanks >> Parth Khatwani >> >> >> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < >> h2016170@pilani.bits-pilani.ac.in> wrote: >> >> > yes i am unable to figure out the way ahead. >> > Like how to create the augmented matrix A := (0|D) which you have >> > mentioned. >> > >> > >> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov >> > wrote: >> > >> >> was my reply for your post on @user has been a bit confusing? >> >> >> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < >> >> h2016170@pilani.bits-pilani.ac.in> wrote: >> >> >> >> > Sir, >> >> > I am trying to write the kmeans clustering algorithm using Mahout >> >> Samsara >> >> > but i am bit confused >> >> > about how to leverage Distributed Row Matrix for the same. Can >> anybody >> >> help >> >> > me with same. >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > Thanks >> >> > Parth Khatwani >> >> > >> >> >> > >> > >> > > --001a113ac9ea9f261e054c0aa5ef--