spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nattavut Sutyanyong (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
Date Sat, 06 Aug 2016 16:20:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090
] 

Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:20 PM:
---------------------------------------------------------------------

{{noformat}}
scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   :     +- 'GlobalLimit 1
   :        +- 'LocalLimit 1
   :           +- 'Project [unresolvedalias(1, None)]
   :              +- 'Filter ('t1.c1 = 't2.c2)
   :                 +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This correlated predicate
is incorrectly moved above the LIMIT
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [1 AS 1#26, c2#10]
   :              +- SubqueryAlias t2
   :                 +- Project [value#8 AS c2#10]
   :                    +- LocalRelation [value#8]
   +- SubqueryAlias t1
      +- Project [value#15 AS c1#17]
         +- LocalRelation [value#15]
{{noformat}}
By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT
1 operation to above it causing the scan of the subquery table to return only 1 row. The correct
semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent
table.


was (Author: nsyca):
scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   :     +- 'GlobalLimit 1
   :        +- 'LocalLimit 1
   :           +- 'Project [unresolvedalias(1, None)]
   :              +- 'Filter ('t1.c1 = 't2.c2)
   :                 +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This correlated predicate
is incorrectly moved above the LIMIT
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [1 AS 1#26, c2#10]
   :              +- SubqueryAlias t2
   :                 +- Project [value#8 AS c2#10]
   :                    +- LocalRelation [value#8]
   +- SubqueryAlias t1
      +- Project [value#15 AS c1#17]
         +- LocalRelation [value#15]

By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT
1 operation to above it causing the scan of the subquery table to return only 1 row. The correct
semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent
table.

> Correlated subqueries containing non-deterministic operators return incorrect results
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-16804
>                 URL: https://issues.apache.org/jira/browse/SPARK-16804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule ResolveSubquery
in the Analysis phase moves correlated predicates to a join predicates and neglect the semantic
of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show
> +---+                                                                           
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message