commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-1465) Increase the initial sampling speed of KMeansPlusPlusClusterer for large k
Date Fri, 10 Aug 2018 18:40:00 GMT

    [ https://issues.apache.org/jira/browse/MATH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576710#comment-16576710
] 

Gilles commented on MATH-1465:
------------------------------

If you intend to implement this, don't forget to use the [development version of the code
(git "master" branch)|https://git1-us-west.apache.org/repos/asf?p=commons-math.git;a=tree].

> Increase the initial sampling speed of KMeansPlusPlusClusterer for large k
> --------------------------------------------------------------------------
>
>                 Key: MATH-1465
>                 URL: https://issues.apache.org/jira/browse/MATH-1465
>             Project: Commons Math
>          Issue Type: Improvement
>    Affects Versions: 3.6.1
>            Reporter: yu zhang
>            Priority: Minor
>              Labels: performance
>   Original Estimate: 1,440h
>  Remaining Estimate: 1,440h
>
> As the major difference of KMeans++ to classic Kmeans, an initial distance square sampling
process is executed, the function is named "chooseInitialCenters" in the current implementation.
The time complexity is O(dkn), where d is the dimension, k is the cluster number and n is
the number of points.
> In paper "_Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. StreamKM++: A Clustering
Algorithm for Data Streams. In Proceedings of the 12th Workshop on Algorithm Engineering and
Experiments (ALENEX 2010)_", the authors introduced a data structure named "coreset tree"
which can reduce the complexity of the initial sampling to O(dn log k). This is useful in
the scenario when value of k is large, say 1000, then the run speed of the algorithm will
be much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message