Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 50EE417C2F for ; Mon, 26 Jan 2015 21:38:34 +0000 (UTC) Received: (qmail 57844 invoked by uid 500); 26 Jan 2015 21:38:34 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 57813 invoked by uid 500); 26 Jan 2015 21:38:34 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 57802 invoked by uid 99); 26 Jan 2015 21:38:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2015 21:38:34 +0000 Date: Mon, 26 Jan 2015 21:38:34 +0000 (UTC) From: "Xuefu Zhang (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292448#comment-14292448 ] Xuefu Zhang commented on SPARK-2688: ------------------------------------ Yeah. We don't need a syntactic suger, but a transformation that just does one pass of the input RDD. This has performance implications on Hive's multi-insert use cases. > Need a way to run multiple data pipeline concurrently > ----------------------------------------------------- > > Key: SPARK-2688 > URL: https://issues.apache.org/jira/browse/SPARK-2688 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 1.0.1 > Reporter: Xuefu Zhang > > Suppose we want to do the following data processing: > {code} > rdd1 -> rdd2 -> rdd3 > | -> rdd4 > | -> rdd5 > \ -> rdd6 > {code} > where -> represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too. > This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org