Mailing-List: contact issues-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Date: Tue, 2 Jun 2015 02:12:17 +0000 (UTC)
From: "Sachin Goel (JIRA)" <jira@apache.org>
To: issues@flink.apache.org
Message-ID: <JIRA.12782878.1426688335000.89205.1433211137294@Atlassian.JIRA>
In-Reply-To: <JIRA.12782878.1426688335000@Atlassian.JIRA>
References: <JIRA.12782878.1426688335000@Atlassian.JIRA>
 <JIRA.12782878.1426688335720@arcas>
Subject: [jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to
 machine learning library
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568392#comment-14568392 ] 

Sachin Goel commented on FLINK-1731:
------------------------------------

I'm creating a separate issue for Initialization schemes. This would address the Random, kmeans++ and kmeans|| initialization methods. Since any initialization itself is a solution to the kmeans problem, they would all be instances of Predictor also. User can access the centroids learned via instance.centroids and pass them to the KMeans algorithm which has been implemented. 
These is another way possible which takes the burden off the user to figure out how to pass the initial centroids to KMeans. We can have a parameter which signifies which initialization scheme to use. The KMeans algorithm would then need to call the appropriate initialization scheme in its fit function and work with the centroids found by the initialization scheme as its initial centroids.

> Add kMeans clustering algorithm to machine learning library
> -----------------------------------------------------------
>
>                 Key: FLINK-1731
>                 URL: https://issues.apache.org/jira/browse/FLINK-1731
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Peter Schrott
>              Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not yet ported to the machine learning library. I assume that only the used data types have to be adapted and then it can be more or less directly moved to flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)