spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhan Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-20006) Separate threshold for broadcast and shuffled hash join
Date Sat, 18 Mar 2017 04:43:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931054#comment-15931054
] 

Zhan Zhang edited comment on SPARK-20006 at 3/18/17 4:42 AM:
-------------------------------------------------------------

The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration
does provide us opportunities to optimize the join dramatically. It would be great if CBO
can automatically find the best strategy. But probably I miss something. Currently the CBO
does not collect right statistics, especially for partitioned table. I have opened a JIRA
for that issue as well. https://issues.apache.org/jira/browse/SPARK-19890


was (Author: zhzhan):
The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration
does provide us opportunities to optimize the join dramatically. It would be great if CBO
can automatically find the best strategy. But probably I miss something. Currently the CBO
does not collect right statistics, especially for partitioned table. https://issues.apache.org/jira/browse/SPARK-19890

> Separate threshold for broadcast and shuffled hash join
> -------------------------------------------------------
>
>                 Key: SPARK-20006
>                 URL: https://issues.apache.org/jira/browse/SPARK-20006
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Zhan Zhang
>            Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same configuration: AUTO_BROADCASTJOIN_THRESHOLD.

> But the memory model may be different. For broadcast, currently the hash map is always
build on heap. For shuffledHashJoin, the hash map may be build on heap(longHash), or off heap(other
map if off heap is enabled). The same configuration makes the configuration hard to tune (how
to allocate memory onheap/offheap). Propose to use different configuration. Please comments
whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message