spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION
Date Tue, 11 Dec 2018 12:08:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716963#comment-16716963
] 

ASF GitHub Bot commented on SPARK-26078:
----------------------------------------

AmplabJenkins commented on issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on
IN subqueries
URL: https://github.com/apache/spark/pull/23057#issuecomment-446179922
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> WHERE .. IN fails to filter rows when used in combination with UNION
> --------------------------------------------------------------------
>
>                 Key: SPARK-26078
>                 URL: https://issues.apache.org/jira/browse/SPARK-26078
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1, 2.4.0
>            Reporter: Arttu Voutilainen
>            Priority: Blocker
>              Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN correctly,
when used in combination with UNION, but instead returns also rows that do not fulfill the
condition. Swapping the order of the datasets in the UNION makes the problem go away. Repro
below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
>     SELECT id,num,source FROM
>     (
>         SELECT id, num, 'a' as source FROM a
>         UNION ALL
>         SELECT id, num, 'b' as source FROM b
>     ) AS c
>     WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
>     SELECT id,num,source FROM
>     (
>         SELECT id, num, 'b' as source FROM b
>         UNION ALL
>         SELECT id, num, 'a' as source FROM a
>     ) AS c
>     WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should not be
there as it  
> {code:java}
> >>> bug.show()
> +---+---+------+
> | id|num|source|
> +---+---+------+
> |  a|  2|     a|
> |  a|  2|     b|
> |  b|  1|     b|
> +---+---+------+
> >>> no_bug.show()
> +---+---+------+
> | id|num|source|
> +---+---+------+
> |  a|  2|     b|
> |  a|  2|     a|
> +---+---+------+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> :     :- LogicalRDD [id#0, num#1L], false
> :     +- Project [id#4]
> :        +- Filter (isnotnull(num#5L) && (num#5L = 2))
> :           +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>    :- Project [id#4, num#5L, b AS source#137]
>    :  +- LogicalRDD [id#4, num#5L], false
>    +- Project [id#4 AS id#4#172]
>       +- Filter (isnotnull(num#5L) && (num#5L = 2))
>          +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition seems wrong,
and I believe it causes the LeftSemi to return true for all rows in the left-hand-side table,
thus failing to filter as the WHERE .. IN should. Compare with the non-buggy version, where
both LeftSemi joins have distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> :     :- LogicalRDD [id#4, num#5L], false
> :     +- Project [id#4 AS id#4#173]
> :        +- Filter (isnotnull(num#5L) && (num#5L = 2))
> :           +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>    +- Join LeftSemi, (id#0 = id#4#173)
>       :- LogicalRDD [id#0, num#1L], false
>       +- Project [id#4 AS id#4#173]
>          +- Filter (isnotnull(num#5L) && (num#5L = 2))
>             +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message