spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.
Date Mon, 23 Apr 2018 06:36:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447626#comment-16447626
] 

Takeshi Yamamuro commented on SPARK-24030:
------------------------------------------

I quickly tried this at least in the master and v2.3.0 though, I couldn't reproduce:
{code:java}
./bin/spark-shell --master=local[1] --conf spark.driver.memory=4g --conf spark.sql.shuffle.partitions=1
-v

scala> :paste
def timer[R](f: => {}): Unit = {
  val count = 5
  val iters = (0 until count).map { i =>
    val t0 = System.nanoTime()
    f
    val t1 = System.nanoTime()
    val elapsed = t1 - t0 + 0.0
    println(s"#$i: ${elapsed / 1000000000.0}")
    elapsed
  }
  println("Avg. Elapsed Time: " + ((iters.sum / count) / 1000000000.0) + "s")
}

scala> timer { spark.range(1060000).selectExpr("percentile_approx(id, 0.5)").collect }
#0: 4.405557999                                                                 
#1: 0.40483767
#2: 0.407931124
#3: 0.424493487
#4: 0.386281957
Avg. Elapsed Time: 1.2058204474s

scala> timer { spark.range(1040000).selectExpr("percentile_approx(id, 0.5)").collect }
#0: 4.560478621                                                                 
#1: 0.387799115
#2: 0.38196225
#3: 0.377551809
#4: 0.390596532
Avg. Elapsed Time: 1.2196776654s
{code}

> SparkSQL percentile_approx function is too slow for over 1,060,000 records.
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-24030
>                 URL: https://issues.apache.org/jira/browse/SPARK-24030
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>         Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop.
>            Reporter: Seok-Joon,Yun
>            Priority: Major
>         Attachments: screenshot_2018-04-20 23.15.02.png
>
>
> I used percentile_approx functions for over 1,060,000 records. It is too slow. It takes
about 90 mins. So I tried for 1,040,000 records. It take about 10 secs.
> I tested for data reading on JDBC and parquet. It takes same time lengths.
> I wonder that function is not designed for multi worker.
> I looked gangglia and spark history. It worked on one worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message