flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Bartnik <bart...@campus.tu-berlin.de>
Subject Bisecting K-Means - Working with intermediate results as DataSets
Date Sun, 28 Aug 2016 21:15:55 GMT
Hi,

I am working on implementing a variant of the k-means algorithm, namely 
Bisecting K-means [1].

The basic premise is to run the original k-means algorithm on 
increasingly smaller subsets of the original input data.
In each step of the outer loop, it splits the current cluster in 2 new 
smaller clusters and delete the corresponding parent cluster.

I am currently using a modified version of the existing k-means 
implementation from the Flink examples.

Pseudocode:

while currentClusterNumber < finalClusterNumber
     currentCluster = Pick current largest cluster
     for i = 1 to innerIterations
         Pick 2 random starting centroids
         Run k-means on currentCluster with centroids
         Store output and compute similarity of temporary result
     Pick the one innerIteration result with highest similarity from 
temporary results
     Replace currentCluster with the two smaller subsets

It all comes down to nested iterations, which are not supported by Flink 
at the moment.

Does anyone has experiences or workarounds to avoid this issue?

Best,
Adrian

----

[1] A Comparison of Document Clustering Techniques - 
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.9225

Mime
View raw message