spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends
Date Mon, 24 Jul 2017 11:30:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16098227#comment-16098227
] 

Hyukjin Kwon commented on SPARK-21177:
--------------------------------------

Wait ... is the code itself a self-contained reproducer? without any hive step?

> df.saveAsTable slows down linearly, with number of appends
> ----------------------------------------------------------
>
>                 Key: SPARK-21177
>                 URL: https://issues.apache.org/jira/browse/SPARK-21177
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>       /_/
>          
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
>     val start = System.nanoTime()
>     f()
>     val end = System.nanoTime()
>     val timetaken = end - start
>     import scala.concurrent.duration._
>     println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>      |      |      |      |      |      |      | printTimeTaken: (str: String, f: ()
=> Unit)Unit
> scala> 
> for(i <- 1 to 100000) {printTimeTaken("time to append to hive:", () => { Seq(1,
2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> ....
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message