Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 180D717DB6 for ; Sun, 25 Jan 2015 15:33:35 +0000 (UTC) Received: (qmail 49264 invoked by uid 500); 25 Jan 2015 15:33:35 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 49231 invoked by uid 500); 25 Jan 2015 15:33:35 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 49219 invoked by uid 99); 25 Jan 2015 15:33:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Jan 2015 15:33:35 +0000 Date: Sun, 25 Jan 2015 15:33:34 +0000 (UTC) From: "Xuefu Zhang (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291145#comment-14291145 ] Xuefu Zhang commented on SPARK-3621: ------------------------------------ I'm not sure if I agree that this is "not a problem". To broadcast is to make certain dataset available to all nodes in the cluster. Existing broadcast functionality is limited to broadcast data in the driver, while this improvement requests that datasets, which already exists in the cluster, be broadcast to all nodes without requiring shipping that dataset from the cluster to the driver and then to all nodes in the cluster again. Improvement is never a problem if we are not open to it. If for some reason this cannot be done, we need to understand the reason. > Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-3621 > URL: https://issues.apache.org/jira/browse/SPARK-3621 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.0.0, 1.1.0 > Reporter: Xuefu Zhang > > In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). > Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org