spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhenhua Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-20010) Sort information is lost after sort merge join
Date Sat, 18 Mar 2017 06:14:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhenhua Wang updated SPARK-20010:
---------------------------------
    Description: 
After sort merge join for inner join, now we only keep left key ordering. However, after inner
join, right key has the same value and order as left key. So if we need another smj on right
key, we will unnecessarily add a sort which causes additional cost.

As a more complicated example, A join B on A.key = B.key join C on B.key = C.key join D on
A.key = D.key. We will unnecessarily add a sort on B.key when join \{A, B\} and C, and add
a sort on A.key when join \{A, B, C\} and D.

To fix this, we need to propagate all sorted information (equivalent expressions) from bottom
up.

  was:
After sort merge join for inner join, now we only keep left key ordering. However, after inner
join, right key has the same value and order as left key. So if we need another smj on right
key, we will unnecessarily add a sort which causes additional cost.

As a more complicated example, A join B on A.key = B.key join C on B.key = C.key join D on
A.key = D.key. We will unnecessarily add a sort on B.key when join {A, B} and C, and add a
sort on A.key when join {A, B, C} and D.

To fix this, we need to propagate all sorted information (equivalent expressions) from bottom
up.


> Sort information is lost after sort merge join
> ----------------------------------------------
>
>                 Key: SPARK-20010
>                 URL: https://issues.apache.org/jira/browse/SPARK-20010
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Zhenhua Wang
>
> After sort merge join for inner join, now we only keep left key ordering. However, after
inner join, right key has the same value and order as left key. So if we need another smj
on right key, we will unnecessarily add a sort which causes additional cost.
> As a more complicated example, A join B on A.key = B.key join C on B.key = C.key join
D on A.key = D.key. We will unnecessarily add a sort on B.key when join \{A, B\} and C, and
add a sort on A.key when join \{A, B, C\} and D.
> To fix this, we need to propagate all sorted information (equivalent expressions) from
bottom up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message