spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
Date Tue, 23 Aug 2016 23:24:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433849#comment-15433849
] 

Herman van Hovell commented on SPARK-17099:
-------------------------------------------

A small update. Disregard my previous diagnoses. This is caused by a bug in the {{EliminateOuterJoin}}
rule. This converts the Right Outer join into an Inner join, see the following optimizer log:
{noformat}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5, int_col_2) * 2)#32]
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                               Project [sum(coalesce(int_col_5,
int_col_2))#34L, (coalesce(int_col_5, int_col_2) * 2)#32]
 +- Filter (isnotnull(sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L) &&
(sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L > cast((coalesce(int_col_5#4,
int_col_2#13)#38 * 2) as bigint)))                                                       
                                                                                         
                                                                                         
                                                                  +- Filter (isnotnull(sum(cast(coalesce(int_col_5#4,
int_col_2#13) as bigint))#37L) && (sum(cast(coalesce(int_col_5#4, int_col_2#13) as
bigint))#37L > cast((coalesce(int_col_5#4, int_col_2#13)#38 * 2) as bigint)))
    +- Aggregate [greatest(coalesce(int_col_5#14, 109), coalesce(int_col_5#4, -449)), coalesce(int_col_5#4,
int_col_2#13)], [sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS sum(coalesce(int_col_5,
int_col_2))#34L, (coalesce(int_col_5#4, int_col_2#13) * 2) AS (coalesce(int_col_5, int_col_2)
* 2)#32, sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS sum(cast(coalesce(int_col_5#4,
int_col_2#13) as bigint))#37L, coalesce(int_col_5#4, int_col_2#13) AS coalesce(int_col_5#4,
int_col_2#13)#38]      +- Aggregate [greatest(coalesce(int_col_5#14, 109), coalesce(int_col_5#4,
-449)), coalesce(int_col_5#4, int_col_2#13)], [sum(cast(coalesce(int_col_5#4, int_col_2#13)
as bigint)) AS sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5#4, int_col_2#13)
* 2) AS (coalesce(int_col_5, int_col_2) * 2)#32, sum(cast(coalesce(int_col_5#4, int_col_2#13)
as bigint)) AS sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L, coalesce(int_col_5#4,
int_col_2#13) AS coalesce(int_col_5#4, int_col_2#13)#38]
       +- Filter isnotnull(coalesce(int_col_5#4, int_col_2#13))                          
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                       +- Filter isnotnull(coalesce(int_col_5#4,
int_col_2#13))
!         +- Join RightOuter, (int_col_2#13 = int_col_5#4)                               
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                          +- Join Inner, (int_col_2#13
= int_col_5#4)
             :- Project [value#2 AS int_col_5#4]                                         
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                             :- Project [value#2
AS int_col_5#4]
             :  +- SerializeFromObject [input[0, int, true] AS value#2]                  
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                             :  +- SerializeFromObject
[input[0, int, true] AS value#2]
             :     +- ExternalRDD [obj#1]                                                
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                             :     +- ExternalRDD
[obj#1]
             +- Project [_1#10 AS int_col_2#13, _2#11 AS int_col_5#14]                   
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                             +- Project [_1#10
AS int_col_2#13, _2#11 AS int_col_5#14]
                +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true], top level
non-flat input object)._1 AS _1#10, assertnotnull(input[0, scala.Tuple2, true], top level
non-flat input object)._2 AS _2#11]                                                      
                                                                                         
                                                                                         
                                                                            +- SerializeFromObject
[assertnotnull(input[0, scala.Tuple2, true], top level non-flat input object)._1 AS _1#10,
assertnotnull(input[0, scala.Tuple2, true], top level non-flat input object)._2 AS _2#11]
                   +- ExternalRDD [obj#9]                                                
                                                                                         
                                                                                         
                                                                                         
                                                                                         
                                                                                   +- ExternalRDD
[obj#9]
                
{noformat}

PR https://github.com/apache/spark/pull/14661 fixes this.

> Incorrect result when HAVING clause is added to group by query
> --------------------------------------------------------------
>
>                 Key: SPARK-17099
>                 URL: https://issues.apache.org/jira/browse/SPARK-17099
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Josh Rosen
>            Priority: Blocker
>
> Random query generation uncovered the following query which returns incorrect results
when run on Spark SQL. This wasn't the original query uncovered by the generator, since I
performed a bit of minimization to try to make it more understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
>     (-769, -244),
>     (-800, -409),
>     (940, 86),
>     (-507, 304),
>     (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>      ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>          COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2))
* 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However,
if I omit the {{HAVING}} clause I see that the group's rows are being incorrectly filtered
by the {{HAVING}} clause:
> {code}
> +--------------------------------------+---------------------------------------+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2)  |
> +--------------------------------------+---------------------------------------+--+
> | -507                                 | -1014                                 |
> | 940                                  | 1880                                  |
> | -769                                 | -1538                                 |
> | -367                                 | -734                                  |
> | -800                                 | -1600                                 |
> +--------------------------------------+---------------------------------------+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm opening this
bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message