spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Li (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-25368) Incorrect predicate pushdown returns wrong result
Date Sun, 09 Sep 2018 16:11:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiao Li reassigned SPARK-25368:
-------------------------------

    Assignee: Yuming Wang

> Incorrect predicate pushdown returns wrong result
> -------------------------------------------------
>
>                 Key: SPARK-25368
>                 URL: https://issues.apache.org/jira/browse/SPARK-25368
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.3.1, 2.3.2
>            Reporter: Lev Katzav
>            Assignee: Yuming Wang
>            Priority: Blocker
>              Labels: correctness
>             Fix For: 2.3.2, 2.4.0
>
>         Attachments: plan.txt
>
>
> there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
> the following code recreates the problem
>  (it's a bit convoluted examples, I tried to simplify it as much as possible from my
code)
> {code:java}
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import spark.implicits._
> case class Data(a: Option[Int],b: String,c: Option[String],d: String)
> val df1 = spark.createDataFrame(Seq(
>    Data(Some(1), "1", None, "1"),
>    Data(None, "2", Some("2"), "2")
> ))
> val df2 = df1
> .where( $"a".isNotNull)
> .withColumn("e", lit(null).cast("string"))
> val columns = df2.columns.map(c => col(c))
> val df3 = df1
> .select(
>   $"c",
>   $"b" as "e"
>   )
>   .withColumn("a", lit(null).cast("int"))
>   .withColumn("b", lit(null).cast("string"))
>   .withColumn("d", lit(null).cast("string"))
>   .select(columns :_*)
> val df4 =
>   df2.union(df3)
>   .withColumn("e", last(col("e"), ignoreNulls = true).over(Window.partitionBy($"c").orderBy($"d")))
>   .filter($"a".isNotNull)
> df4.show
> {code}
>  
> notice that the last statement in for df4 is to filter rows where a is null
> in spark 2.2.1, the above code prints:
> {code:java}
> +---+---+----+---+---+ 
> | a| b| c| d| e|
>  +---+---+----+---+---+ 
> | 1| 1|null| 1| 1| 
> +---+---+----+---+---+
> {code}
> in spark 2.3.x, it prints: 
> {code:java}
> +----+----+----+----+---+ 
> | a| b| c| d| e| 
> +----+----+----+----+---+ 
> |null|null|null|null| 1| 
> | 1| 1|null| 1| 1| 
> |null|null| 2|null| 2|
>  +----+----+----+----+---+
> {code}
>  the column a still contains null values
>  
> attached are the plans.
> in the parsed logical plan, the filter for isnotnull('a), is on top,
>  but in the optimized logical plan, it is pushed down



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message