spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-17389) KMeans speedup with better choice of k-means|| init steps = 2
Date Sun, 04 Sep 2016 11:51:20 GMT


Apache Spark commented on SPARK-17389:

User 'srowen' has created a pull request for this issue:

> KMeans speedup with better choice of k-means|| init steps = 2
> -------------------------------------------------------------
>                 Key: SPARK-17389
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.0.0
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
> As reported in
KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans||
initialization. For example, in this simple example...
> {code}
> import org.apache.spark.mllib.random.RandomRDDs
> import org.apache.spark.mllib.clustering.KMeans
> val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
> data.count()
> new KMeans().setK(1000).setMaxIterations(5).run(data)
> {code}
> Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about
as long as 24 iterations, which is a typical run, meaning half the time is just in picking
cluster centers. This seems excessive.
> There are two ways to speed this up significantly. First, the implementation has an old
"runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed
at once. The code can be simplified significantly now that runs=1 always. This is already
covered by SPARK-11560, but just a simple refactoring results in about a 13% init speedup,
from 5:54 to 5:09 in this example. That's not what this change is about though.
> By default, k-means|| makes 5 passes over the data. The original paper at
actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation.
(See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting
5 is an optimal value.) Defaulting to 2 brings it down to 1:41 -- much improved over 5:54.
> Lastly, small thing, but the code will perform a local k-means++ step to reduce the number
of centers to k even if there are already only <= k centers. This can be short-circuited.
However this is really the topic of SPARK-3261 because this can cause fewer than k clusters
to be returned where that would actually be correct, too.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message