spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Placek (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-9141) DataFrame recomputed instead of using cached parent.
Date Sun, 31 Jan 2016 22:46:39 GMT

    [ https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125517#comment-15125517
] 

Placek edited comment on SPARK-9141 at 1/31/16 10:46 PM:
---------------------------------------------------------

It grows linearly. I got the errors when the string wasn't printed out. I did not see any
of those problems when working with RDDs.

PS. Would having exponential lineage (while not printing it out as a string) cause errors
in Spark SQL?


was (Author: zephod):
It grows linearly. I got the errors when the string wasn't printed out. I did not see any
of those problems when working with RDDs.

> DataFrame recomputed instead of using cached parent.
> ----------------------------------------------------
>
>                 Key: SPARK-9141
>                 URL: https://issues.apache.org/jira/browse/SPARK-9141
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0, 1.4.1
>            Reporter: Nick Pritchard
>            Assignee: Michael Armbrust
>            Priority: Blocker
>              Labels: cache, dataframe
>             Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as RDD.cache(), so that
repeated operations on it will use the cached results and not recompute the entire lineage.
However, it seems that some DataFrame operations (e.g. withColumn) change the underlying RDD
lineage so that cache doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's that  use
println so that it is easy to see when they are being called. Next, I create a simple data
frame with one row and two columns. Next, I add a column, cache it, and call count() to force
the computation. Lastly, I add another column, cache it, and call count().
> I would have expected the last statement to only compute the last column, since everything
else was cached. However, because withColumn() changes the lineage, the whole data frame is
recomputed.
> {code}
>     // Examples udf's that println when called 
>     val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
>     val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 
>     // Initial dataset 
>     val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 
>     // Add column by applying twice udf 
>     val df2 = df1.withColumn("twice", twice($"value")) 
>     df2.cache() 
>     df2.count() //prints Computed: twice(1) 
>     // Add column by applying triple udf 
>     val df3 = df2.withColumn("triple", triple($"value")) 
>     df3.cache() 
>     df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
> {code}
> I found a workaround, which helped me understand what was going on behind the scenes,
but doesn't seem like an ideal solution. Basically, I convert to RDD then back DataFrame,
which seems to freeze the lineage. The code below shows the workaround for creating the second
data frame so cache will work as expected.
> {code}
>     val df2 = {
>       val tmp = df1.withColumn("twice", twice($"value"))
>       sqlContext.createDataFrame(tmp.rdd, tmp.schema)
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message