spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
Date Mon, 21 Nov 2016 14:19:58 GMT


Apache Spark reassigned SPARK-18356:

    Assignee: Apache Spark

> Issue + Resolution: Kmeans Spark Performances (ML package)
> ----------------------------------------------------------
>                 Key: SPARK-18356
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: zakaria hili
>            Assignee: Apache Spark
>            Priority: Minor
>              Labels: easyfix
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect spark Kmeans
> Before starting to explain the problem, I want to explain the warning that I faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)    
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
>                     kmeans = KMeans().setK(k)
>                     model =
>                     wssse = model.computeCost(df_Part)
>                     k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt performance if its
parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I realized there
is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
> When my  dataframe is cached, the fit method transform my dataframe into an internally
rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into rdd
then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here spark verify
if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans Algorithm.
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message