spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ohad Raviv (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-18748) UDF multiple evaluations causes very poor performance
Date Tue, 06 Dec 2016 19:42:58 GMT
Ohad Raviv created SPARK-18748:
----------------------------------

             Summary: UDF multiple evaluations causes very poor performance
                 Key: SPARK-18748
                 URL: https://issues.apache.org/jira/browse/SPARK-18748
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Ohad Raviv


We have a use case where we have a relatively expensive UDF that needs to be calculated. The
problem is that instead of being calculated once, it gets calculated over and over again.
for example:
{quote}
def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\}
hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is not null and
c<>''").show
{quote}
with the output:
{quote}
blahblah1
blahblah1
blahblah1
+-------+
|      c|
+-------+
|nothing|
+-------+
{quote}
You can see that for each reference of column "c" you will get the println.
that causes very poor performance for our real use case.
This also came out on StackOverflow:
http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/

with two problematic work-arounds:
1. cache() after the first time. e.g.
{quote}
hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not null and c<>''").show
{quote}
while it works, in our case we can't do that because the table is too big to cache.

2. move back and forth to rdd:
{quote}
val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and c<>''").show
{quote}
which works but then we loose some of the optimizations like push down predicate features,
etc. and its very ugly.

Any ideas on how we can make the UDF get calculated just once in a reasonable way?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message