Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1870B200BB3 for ; Wed, 19 Oct 2016 03:04:51 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 170E6160AF7; Wed, 19 Oct 2016 01:04:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5CE00160AE5 for ; Wed, 19 Oct 2016 03:04:50 +0200 (CEST) Received: (qmail 44212 invoked by uid 500); 19 Oct 2016 01:04:49 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 44198 invoked by uid 99); 19 Oct 2016 01:04:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2016 01:04:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 517B9180149 for ; Wed, 19 Oct 2016 01:04:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.486 X-Spam-Level: *** X-Spam-Status: No, score=3.486 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_SOFTFAIL=0.972, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id zzDaBSPgV8jG for ; Wed, 19 Oct 2016 01:04:46 +0000 (UTC) Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2B05E5FB32 for ; Wed, 19 Oct 2016 01:04:46 +0000 (UTC) Received: from mben.nabble.com (unknown [162.253.133.72]) by mwork.nabble.com (Postfix) with ESMTP id 180035CFCDEA5 for ; Tue, 18 Oct 2016 18:04:32 -0700 (MST) Date: Tue, 18 Oct 2016 18:04:32 -0700 (MST) From: Zhan Zhang To: dev@spark.apache.org Message-ID: <1476839072094-19502.post@n3.nabble.com> Subject: SparkPlan/Shuffle stage reuse with Dataset/DataFrame MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit archived-at: Wed, 19 Oct 2016 01:04:51 -0000 Hi Folks, We have some Dataset/Dataframe use cases that will benefit from reuse the SparkPlan and shuffle stage. For example, the following cases. Because the query optimization and sparkplan is generated by catalyst when it is executed, as a result, the underlying RDD lineage is regenerated for dataset1. Thus, the shuffle stage will be executed multiple times. val dataset1 = dataset.groupby.agg df.registerTempTable("tmpTable") spark.sql("select * from tmpTable where condition").collect spark.sql("select * from tmpTable where condition1").cllect On the one side, we get optimized query plan, but on the other side, we cannot reuse the data generated by shuffle stage. Currently, to reuse the dataset1, we have to use persist to cache the data. It is helpful but sometimes is not what we want, as it has some side effect. For example, we cannot release the executor that has active cache in it even it is idle and dynamic allocator is enabled. In other words, we only want to reuse the shuffle data as much as possible without caching in a long pipeline with multiple shuffle stages. I am wondering does it make sense to add a new feature to Dataset/Dataframe to work as barrier and prevent the query optimization happens across the barrier. For example, in the above case, we want catalyst take tmpTable as a barrier, and stop optimization across it, so that we can reuse the underlying rdd lineage of dataset1. The prototype code to make it work is quite small, and we tried in house with a new API as Dataset.cacheShuffle to make this happen. But I want some feedback from community before opening a JIRA, as in some sense, it does stop the optimization earlier. Any comments? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkPlan-Shuffle-stage-reuse-with-Dataset-DataFrame-tp19502.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscribe@spark.apache.org