Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9FE21200C4F for ; Sat, 18 Mar 2017 05:43:46 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 9E5A2160B8F; Sat, 18 Mar 2017 04:43:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E3B53160B80 for ; Sat, 18 Mar 2017 05:43:45 +0100 (CET) Received: (qmail 76395 invoked by uid 500); 18 Mar 2017 04:43:45 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 76385 invoked by uid 99); 18 Mar 2017 04:43:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Mar 2017 04:43:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 63187C130D for ; Sat, 18 Mar 2017 04:43:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id V7DdOnM6WmkT for ; Sat, 18 Mar 2017 04:43:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 367535F297 for ; Sat, 18 Mar 2017 04:43:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 55E9BE09F4 for ; Sat, 18 Mar 2017 04:43:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B542B254BD for ; Sat, 18 Mar 2017 04:43:41 +0000 (UTC) Date: Sat, 18 Mar 2017 04:43:41 +0000 (UTC) From: "Zhan Zhang (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (SPARK-20006) Separate threshold for broadcast and shuffled hash join MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 18 Mar 2017 04:43:46 -0000 [ https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931054#comment-15931054 ] Zhan Zhang edited comment on SPARK-20006 at 3/18/17 4:42 AM: ------------------------------------------------------------- The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. I have opened a JIRA for that issue as well. https://issues.apache.org/jira/browse/SPARK-19890 was (Author: zhzhan): The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. https://issues.apache.org/jira/browse/SPARK-19890 > Separate threshold for broadcast and shuffled hash join > ------------------------------------------------------- > > Key: SPARK-20006 > URL: https://issues.apache.org/jira/browse/SPARK-20006 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Zhan Zhang > Priority: Minor > > Currently both canBroadcast and canBuildLocalHashMap use the same configuration: AUTO_BROADCASTJOIN_THRESHOLD. > But the memory model may be different. For broadcast, currently the hash map is always build on heap. For shuffledHashJoin, the hash map may be build on heap(longHash), or off heap(other map if off heap is enabled). The same configuration makes the configuration hard to tune (how to allocate memory onheap/offheap). Propose to use different configuration. Please comments whether it is reasonable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org