mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Date Fri, 31 Mar 2017 18:04:55 GMT
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used

On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:

> here is the outline. For details of APIs, please refer to samsara manual
> [2], i will not be be repeating it.
>
> Assume your training data input is m x n matrix A. For simplicity let's
> assume it's a DRM with int row keys, i.e., DrmLike[Int].
>
> Initialization:
>
> First, classic k-means starts by selecting initial clusters, by sampling
> them out. You can do that by using sampling api [1], thus forming a k x n
> in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
> type.
>
> You the proceed by alternating between cluster assignments and
> recompupting centroid matrix C till convergence based on some test or
> simply limited by epoch count budget, your choice.
>
> Cluster assignments: here, we go over current generation of A and
> recompute centroid indexes for each row in A. Once we recompute index, we
> put it into the row key . You can do that by assigning centroid indices to
> keys of A using operator mapblock() (details in [2], [3], [4]). You also
> need to broadcast C in order to be able to access it in efficient manner
> inside mapblock() closure. Examples of that are plenty given in [2].
> Essentially, in mapblock, you'd reform the row keys to reflect cluster
> index in C. while going over A, you'd have a "nearest neighbor" problem to
> solve for the row of A and centroids C. This is the bulk of computation
> really, and there are a few tricks there that can speed this step up in
> both exact and approximate manner, but you can start with a naive search.
>
> Centroid recomputation:
> once you assigned centroids to the keys of marix A, you'd want to do an
> aggregating transpose of A to compute essentially average of row A grouped
> by the centroid key. The trick is to do a computation of (1|A)' which will
> results in a matrix of the shape (Counts/sums of cluster rows). This is the
> part i find difficult to explain without a latex graphics.
>
> In Samsara, construction of (1|A)' corresponds to DRM expression
>
> (1 cbind A).t (again, see [2]).
>
> So when you compute, say,
>
> B = (1 | A)',
>
> then B is (n+1) x k, so each column contains a vector corresponding to a
> cluster 1..k. In such column, the first element would be # of points in the
> cluster, and the rest of it would correspond to sum of all points. So in
> order to arrive to an updated matrix C, we need to collect B into memory,
> and slice out counters (first row) from the rest of it.
>
> So, to compute C:
>
> C <- B (2:,:) each row divided by B(1,:)
>
> (watch out for empty clusters with 0 elements, this will cause lack of
> convergence and NaNs in the newly computed C).
>
> This operation obviously uses subblocking and row-wise iteration over B,
> for which i am again making reference to [2].
>
>
> [1] https://github.com/apache/mahout/blob/master/math-scala/
> src/main/scala/org/apache/mahout/math/drm/package.scala#L149
>
> [2], Sasmara manual, a bit dated but viable, http://apache.github.
> io/mahout/doc/ScalaSparkBindings.html
>
> [3] scaladoc, again, dated but largely viable for the purpose of this
> exercise:
> http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
>
> [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
> math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
>
> On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
> h2016170@pilani.bits-pilani.ac.in> wrote:
>
>> @Dmitriycan you please again tell me the approach to move ahead.
>>
>>
>> Thanks
>> Parth Khatwani
>>
>>
>> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
>> h2016170@pilani.bits-pilani.ac.in> wrote:
>>
>> > yes i am unable to figure out the way ahead.
>> > Like how to create the augmented matrix A := (0|D) which you have
>> > mentioned.
>> >
>> >
>> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> > wrote:
>> >
>> >> was my reply for your post on @user has been a bit confusing?
>> >>
>> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
>> >> h2016170@pilani.bits-pilani.ac.in> wrote:
>> >>
>> >> > Sir,
>> >> > I am trying to write the kmeans clustering algorithm using Mahout
>> >> Samsara
>> >> > but i am bit confused
>> >> > about how to leverage Distributed Row Matrix for the same. Can
>> anybody
>> >> help
>> >> > me with same.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> > Parth Khatwani
>> >> >
>> >>
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message