spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benyi Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
Date Thu, 12 Oct 2017 17:03:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202278#comment-16202278
] 

Benyi Wang commented on SPARK-22211:
------------------------------------

I think my suggestion solution is correct.

|| Case || Left join key || Right join key || Full outer join ||
| 1 | Y | N | {{(left(\*), null)}} |
| 2 | Y | Y | {{(left(\*), right(\*))}} |
| 3 | N | Y | {{(null, right(\*))}} |
| 4 | N | N | Not applied |

If LimitPushDown pushes limit to the left side, whatever a limit value is and how big of left
side table, you will always select some rows, in other words, the join keys are always exists,
and only case 1 and 2 will happen, so it is actually a Left-join instead. It is equivalent
to right-join when pushing down to  the right side.

The only problem of this method is: case 3 has no chance to be shown while pushing down to
the left side, and case 1 for the right side. I would say this is not a big issue because
we just want to see some samples of the join result, but the benefit is huge. If we want to
see left-only or right-only, we might add where clause.  

> LimitPushDown optimization for FullOuterJoin generates wrong results
> --------------------------------------------------------------------
>
>                 Key: SPARK-22211
>                 URL: https://issues.apache.org/jira/browse/SPARK-22211
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>         Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>            Reporter: Benyi Wang
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may generate
a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 is selected,
but at right side we have 100K rows including 999, the result will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/100000th chance to be selected by
CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/6888856075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 100000).toDF("id")
> val dr = shuffle(1 to 100000).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>    :- *Sort [id#10 ASC NULLS FIRST], false, 0
>    :  +- Exchange hashpartitioning(id#10, 200)
>    :     +- *LocalLimit 1
>    :        +- LocalTableScan [id#10]
>    +- *Sort [id#16 ASC NULLS FIRST], false, 0
>       +- Exchange hashpartitioning(id#16, 200)
>          +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> +----+---+
> |id  |id |
> +----+---+
> |null|148|
> +----+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message