spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition
Date Tue, 18 Apr 2017 05:24:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972142#comment-15972142
] 

Takeshi Yamamuro commented on SPARK-20313:
------------------------------------------

What's the issue that you'd like to point out? I think the description is ambiguous. 

> Possible lack of join optimization when partitions are in the join condition
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-20313
>                 URL: https://issues.apache.org/jira/browse/SPARK-20313
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer
>    Affects Versions: 2.1.0
>            Reporter: Albert Meltzer
>
> Given two tables T1 and T2, partitioned on column part1, the following have vastly different
execution performance:
> // initial, slow
> {noformat}
> val df1 = // load data from T1
>   .filter(functions.col("part1").between("val1", "val2")
> val df2 = // load data from T2
>   .filter(functions.col("part1").between("val1", "val2")
> val df3 = df1.join(df2, Seq("part1", "col1"))
> {noformat}
> // manually optimized, considerably faster
> {noformat}
> val df1 = // load data from T1
> val df2 = // load data from T2
> val part1values = Seq(...) // a collection of values between val1 and val2
> val df3 = part1values
>   .map(part1value => {
>     val df1filtered = df1.filter(functions.col("part1") === part1value)
>     val df2filtered = df2.filter(functions.col("part1") === part1value)
>     df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
>   })
>   .reduce(_ union _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message