Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1BC34183E0 for ; Thu, 28 May 2015 03:32:19 +0000 (UTC) Received: (qmail 2829 invoked by uid 500); 28 May 2015 03:32:18 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 2470 invoked by uid 500); 28 May 2015 03:32:18 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 2458 invoked by uid 99); 28 May 2015 03:32:18 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 May 2015 03:32:18 +0000 Date: Thu, 28 May 2015 03:32:18 +0000 (UTC) From: "Chengxiang Li (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch] MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562241#comment-14562241 ] Chengxiang Li commented on HIVE-10550: -------------------------------------- Note: these configurations has been removed in latest patch. > Dynamic RDD caching optimization for HoS.[Spark Branch] > ------------------------------------------------------- > > Key: HIVE-10550 > URL: https://issues.apache.org/jira/browse/HIVE-10550 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Chengxiang Li > Assignee: Chengxiang Li > Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch, HIVE-10550.4-spark.patch, HIVE-10550.5-spark.patch, HIVE-10550.6-spark.patch > > > A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql] is an example. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context, we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)