Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A504F114BF for ; Mon, 14 Jul 2014 03:32:05 +0000 (UTC) Received: (qmail 95199 invoked by uid 500); 14 Jul 2014 03:32:05 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 95122 invoked by uid 500); 14 Jul 2014 03:32:05 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 95106 invoked by uid 500); 14 Jul 2014 03:32:05 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 95103 invoked by uid 99); 14 Jul 2014 03:32:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2014 03:32:05 +0000 Date: Mon, 14 Jul 2014 03:32:05 +0000 (UTC) From: "Lefty Leverenz (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-2206?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14060= 290#comment-14060290 ]=20 Lefty Leverenz commented on HIVE-2206: -------------------------------------- This added *hive.optimize.correlation* in HiveConf.java with a description = in hive-default.xml.template, so the parameter needs to be documented in th= e wiki (Configuration Properties). Note that HIVE-7362 proposes to change the default for *hive.optimize.corre= lation* to true. General documentation for the correlation optimizer is covered by HIVE-5130= . > add a new optimizer for query correlation discovery and optimization > -------------------------------------------------------------------- > > Key: HIVE-2206 > URL: https://issues.apache.org/jira/browse/HIVE-2206 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.12.0 > Reporter: He Yongqiang > Assignee: Yin Huai > Labels: TODOC12 > Fix For: 0.12.0 > > Attachments: HIVE-2206.1.patch.txt, HIVE-2206.10-r1384442.patch.t= xt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-= 2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r= 1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.p= atch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt,= HIVE-2206.2.patch.txt, HIVE-2206.20-r1434012.patch.txt, HIVE-2206.3.patch.= txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,= HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.t= xt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.D11097.1.patch, HIVE-2206.D11= 097.10.patch, HIVE-2206.D11097.11.patch, HIVE-2206.D11097.12.patch, HIVE-22= 06.D11097.13.patch, HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, H= IVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, HIVE-2206.D11097.18.pa= tch, HIVE-2206.D11097.19.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.= 20.patch, HIVE-2206.D11097.21.patch, HIVE-2206.D11097.22.patch, HIVE-2206.D= 11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-220= 6.D11097.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-= 2206.D11097.9.patch, HIVE-2206.patch, YSmartPatchForHive.patch, testQueries= .2.q > > > This issue proposes a new logical optimizer called Correlation Optimizer,= which is used to merge correlated MapReduce jobs (MR jobs) into a single M= R job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The= paper and slides of YSmart are linked at the bottom. > Since Hive translates queries in a sentence by sentence fashion, for ever= y operation which may need to shuffle the data (e.g. join and aggregation o= perations), Hive will generate a MapReduce job for that operation. However,= for those operations which may need to shuffle the data, they may involve = correlations explained below and thus can be executed in a single MR job. > # Input Correlation: Multiple MR jobs have input correlation (IC) if thei= r input relation sets are not disjoint; > # Transit Correlation: Multiple MR jobs have transit correlation (TC) if = they have not only input correlation, but also the same partition key; > # Job Flow Correlation: An MR has job =EF=AC=82ow correlation (JFC) with = one of its child nodes if it has the same partition key as that child node. > The current implementation of correlation optimizer only detect correlati= ons among MR jobs for reduce-side join operators and reduce-side aggregatio= n operators (not map only aggregation). A query will be optimized if it sat= isfies following conditions. > # There exists a MR job for reduce-side join operator or reduce side aggr= egation operator which have JFC with all of its parents MR jobs (TCs will b= e also exploited if JFC exists); > # All input tables of those correlated MR job are original input tables (= not intermediate tables generated by sub-queries); and=20 > # No self join is involved in those correlated MR jobs. > Correlation optimizer is implemented as a logical optimizer. The main rea= sons are that it only needs to manipulate the query plan tree and it can le= verage the existing component on generating MR jobs. > Current implementation can serve as a framework for correlation related o= ptimizations. I think that it is better than adding individual optimizers.= =20 > There are several work that can be done in future to improve this optimiz= er. Here are three examples. > # Support queries only involve TC; > # Support queries in which input tables of correlated MR jobs involves in= termediate tables; and=20 > # Optimize queries involving self join.=20 > References: > Paper and presentation of YSmart. > Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR= -11-7.pdf > Slides: http://sdrv.ms/UpwJJc -- This message was sent by Atlassian JIRA (v6.2#6252)