flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sachin Goel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library
Date Tue, 02 Jun 2015 02:12:17 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568392#comment-14568392

Sachin Goel commented on FLINK-1731:

I'm creating a separate issue for Initialization schemes. This would address the Random, kmeans++
and kmeans|| initialization methods. Since any initialization itself is a solution to the
kmeans problem, they would all be instances of Predictor also. User can access the centroids
learned via instance.centroids and pass them to the KMeans algorithm which has been implemented.

These is another way possible which takes the burden off the user to figure out how to pass
the initial centroids to KMeans. We can have a parameter which signifies which initialization
scheme to use. The KMeans algorithm would then need to call the appropriate initialization
scheme in its fit function and work with the centroids found by the initialization scheme
as its initial centroids.

> Add kMeans clustering algorithm to machine learning library
> -----------------------------------------------------------
>                 Key: FLINK-1731
>                 URL: https://issues.apache.org/jira/browse/FLINK-1731
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Peter Schrott
>              Labels: ML
> The Flink repository already contains a kMeans implementation but it is not yet ported
to the machine learning library. I assume that only the used data types have to be adapted
and then it can be more or less directly moved to flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because
the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile
to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

This message was sent by Atlassian JIRA

View raw message