spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yu Ishikawa (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans
Date Thu, 30 Oct 2014 15:02:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190166#comment-14190166
] 

Yu Ishikawa edited comment on SPARK-2429 at 10/30/14 3:02 PM:
--------------------------------------------------------------

I compared training and predicting elapsed times of the hierarchical clustering with them
of kmeans.
In fact, the theoretical computational complexity of hierarchical clustering assingment is
smaller than that of kmeans.
However, not only predicting time but also predicting time of the hierarchical clustering
are slower than them of kmeans.

I used the below url's program for this experiment.
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala

h3. Spark Cluster Specification

I run it on EC2 under the below specification.

- Master Instance Type: r3.large
- Slave Instance Type: r3.8xlarge
-- Cores: 32
-- Memory: 244GB
- # of Slaves: 5
-- Total Cores: 160
-- Total Memory: 1220GB

h3. The Performance Result

{noformat}
{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : "1000000", "numPartitions"
: "160"}
KMeans Training Elapsed Time: 28.179 [sec]
KMeans Predicting Elapsed Time: 0.011 [sec]
Hierarchical Training Elapsed Time: 46.539 [sec]
Hierarchical Predicting Elapsed Time: 0.3076923076923077 [sec]

{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : "5000000", "numPartitions"
: "160"}
KMeans Training Elapsed Time: 55.187 [sec]
KMeans Predicting Elapsed Time: 0.008 [sec]
Hierarchical Training Elapsed Time: 210.238 [sec]
Hierarchical Predicting Elapsed Time: 0.3906093906093906 [sec]
{noformat}



was (Author: yuu.ishikawa@gmail.com):
I compared training and predicting elapsed times of the hierarchical clustering with them
of kmeans.
In fact, the theoretical computational complexity of hierarchical clustering assingment is
smaller than that of kmeans.
However, not only predicting time but also predicting time of the hierarchical clustering
are slower than them of kmeans.

I used the below url's program for this experiment.
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala

h3. Spark Cluster Specification

I run it on EC2 under the below specification.

- Master Instance Type: r3.large
- Slave Instance Type: r3.8xlarge
-- Cores: 32
-- Memory: 244GB
- # of Slaves: 5
-- Total Cores: 160
-- Total Memory: 1220GB

h3. The Performance Result

{noformat}
{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : "1000000", "numPartitions"
: "160"}
KMeans Training Elappsed Time: 28.179 [sec]
KMeans Predicting Elappsed Time: 0.011 [sec]
Hierarchical Training Elappsed Time: 46.539 [sec]
Hierarchical Predicting Elappsed Time: 0.3076923076923077 [sec]

{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : "5000000", "numPartitions"
: "160"}
KMeans Training Elappsed Time: 55.187 [sec]
KMeans Predicting Elappsed Time: 0.008 [sec]
Hierarchical Training Elappsed Time: 210.238 [sec]
Hierarchical Predicting Elappsed Time: 0.3906093906093906 [sec]
{noformat}


> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking
a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice addition to
MLlib.  Clustering algorithms are useful for determining relationships between clusters as
well as offering faster assignment. Discussion on the dev list suggested the following possible
approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean such as
negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message