spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hagai Attias (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'
Date Tue, 01 Jan 2019 09:30:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731548#comment-16731548
] 

Hagai Attias commented on SPARK-26510:
--------------------------------------

is there a release version containing this fix?

> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26510
>                 URL: https://issues.apache.org/jira/browse/SPARK-26510
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0
>            Reporter: Hagai Attias
>            Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching a Dataframe
and saving it as a temp table. In 1.6, the following code executed {{printUDF}} once. The
equivalent code in 2.3 (or even same as is) executes it twice.
>  
> {code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
> {code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.createOrReplaceTempView("cached_table")
> val df3 = session.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe twice.
> Managed to mitigate by skipping the temporary table and selecting directly from the
cached dataframe, but was wondering if that is an expected behavior / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message