spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization
Date Thu, 19 Apr 2018 21:18:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444814#comment-16444814
] 

Apache Spark commented on SPARK-24020:
--------------------------------------

User 'zecevicp' has created a pull request for this issue:
https://github.com/apache/spark/pull/21109

> Sort-merge join inner range optimization
> ----------------------------------------
>
>                 Key: SPARK-24020
>                 URL: https://issues.apache.org/jira/browse/SPARK-24020
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Petar Zecevic
>            Priority: Major
>
> The problem we are solving is the case where you have two big tables partitioned by X
column, but also sorted by Y column (within partitions) and you need to calculate an expensive
function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows
that have the same X values and calculate the function's value on all of them. If the two
tables have a large number of rows per X, this can result in a huge number of calculations.
> We hereby propose an optimization that would allow you to reduce the number of matching
rows per X using a range condition on Y columns of the two tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no influence on the
number of rows (per X) being checked because these extra conditions are put in the same block
with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra conditions
are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This
queue would then used as a moving window across the values from the right relation as the
left row changes. You could call this a combination of an equi-join and a theta join (we call
it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance calculations.
> The optimization should be triggered automatically when an equi-join expression is present
AND lower and upper range conditions on a secondary column are specified. If the tables aren't
sorted by both columns, appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter (tentatively
named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the
optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message