Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D46C318657 for ; Tue, 2 Jun 2015 02:12:17 +0000 (UTC) Received: (qmail 65512 invoked by uid 500); 2 Jun 2015 02:12:17 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 65466 invoked by uid 500); 2 Jun 2015 02:12:17 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 65453 invoked by uid 99); 2 Jun 2015 02:12:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jun 2015 02:12:17 +0000 Date: Tue, 2 Jun 2015 02:12:17 +0000 (UTC) From: "Sachin Goel (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568392#comment-14568392 ] Sachin Goel commented on FLINK-1731: ------------------------------------ I'm creating a separate issue for Initialization schemes. This would address the Random, kmeans++ and kmeans|| initialization methods. Since any initialization itself is a solution to the kmeans problem, they would all be instances of Predictor also. User can access the centroids learned via instance.centroids and pass them to the KMeans algorithm which has been implemented. These is another way possible which takes the burden off the user to figure out how to pass the initial centroids to KMeans. We can have a parameter which signifies which initialization scheme to use. The KMeans algorithm would then need to call the appropriate initialization scheme in its fit function and work with the centroids found by the initialization scheme as its initial centroids. > Add kMeans clustering algorithm to machine learning library > ----------------------------------------------------------- > > Key: FLINK-1731 > URL: https://issues.apache.org/jira/browse/FLINK-1731 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Peter Schrott > Labels: ML > > The Flink repository already contains a kMeans implementation but it is not yet ported to the machine learning library. I assume that only the used data types have to be adapted and then it can be more or less directly moved to flink-ml. > The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile to implement kMeans||. > Resources: > [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf > [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)