spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maryann Xue (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24613) Cache with UDF could not be matched with subsequent dependent caches
Date Wed, 20 Jun 2018 23:17:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Maryann Xue updated SPARK-24613:
--------------------------------
    Priority: Minor  (was: Major)

> Cache with UDF could not be matched with subsequent dependent caches
> --------------------------------------------------------------------
>
>                 Key: SPARK-24613
>                 URL: https://issues.apache.org/jira/browse/SPARK-24613
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Maryann Xue
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> When caching a query, we generate its execution plan from the query's logical plan. However,
the logical plan we get from the Dataset has already been analyzed, and when we try the get
the execution plan, this already analyzed logical plan will be analyzed again in the new QueryExecution
object, and unfortunately some rules have side effects if applied multiple times, which in
this case, is the {{HandleNullInputsForUDF}} rule. The re-analyzed plan now has an extra null-check
and can't be matched against the same plan. The following test would fail since {{df2}}'s
execution plan inside the CacheManager does not depend on {{df1}}.
> {code:java}
> test("cache UDF result correctly 2") {
>   val expensiveUDF = udf({x: Int => Thread.sleep(10000); x})
>   val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
>   val df2 = df.agg(sum(df("b")))
>   df.cache()
>   df.count()
>   df2.cache()
>   // udf has been evaluated during caching, and thus should not be re-evaluated here
>   failAfter(5 seconds) {
>     df2.collect()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message