Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 044F0CB26 for ; Thu, 18 Jul 2013 18:20:55 +0000 (UTC) Received: (qmail 5963 invoked by uid 500); 18 Jul 2013 18:20:54 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 5914 invoked by uid 500); 18 Jul 2013 18:20:54 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 5882 invoked by uid 500); 18 Jul 2013 18:20:53 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 5878 invoked by uid 99); 18 Jul 2013 18:20:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 18:20:53 +0000 Date: Thu, 18 Jul 2013 18:20:53 +0000 (UTC) From: "Hudson (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-2206?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13712= 597#comment-13712597 ]=20 Hudson commented on HIVE-2206: ------------------------------ SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #90 (See [https://builds.ap= ache.org/job/Hive-trunk-hadoop1-ptest/90/]) HIVE-2206 [jira] add a new optimizer for query correlation discovery and op= timization (Yin Huai via Ashutosh Chauhan) Summary: update test results This issue proposes a new logical optimizer called Correlation Optimizer, w= hich is used to merge correlated MapReduce jobs (MR jobs) into a single MR = job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The p= aper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every = operation which may need to shuffle the data (e.g. join and aggregation ope= rations), Hive will generate a MapReduce job for that operation. However, f= or those operations which may need to shuffle the data, they may involve co= rrelations explained below and thus can be executed in a single MR job. =09Input Correlation: Multiple MR jobs have input correlation (IC) if their= input relation sets are not disjoint; =09Transit Correlation: Multiple MR jobs have transit correlation (TC) if t= hey have not only input correlation, but also the same partition key; =09Job Flow Correlation: An MR has job =EF=AC=82ow correlation (JFC) with o= ne of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlation= s among MR jobs for reduce-side join operators and reduce-side aggregation = operators (not map only aggregation). A query will be optimized if it satis= fies following conditions. =09There exists a MR job for reduce-side join operator or reduce side aggre= gation operator which have JFC with all of its parents MR jobs (TCs will be= also exploited if JFC exists); =09All input tables of those correlated MR job are original input tables (n= ot intermediate tables generated by sub-queries); and =09No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reaso= ns are that it only needs to manipulate the query plan tree and it can leve= rage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related opt= imizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer= . Here are three examples. =09Support queries only involve TC; =09Support queries in which input tables of correlated MR jobs involves int= ermediate tables; and =09Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-1= 1-7.pdf Slides: http://sdrv.ms/UpwJJc Test Plan: EMPTY Reviewers: JIRA, ashutoshc Reviewed By: ashutoshc CC: brock Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http= ://svn.apache.org/viewcvs.cgi/?root=3DApache-SVN&view=3Drev&rev=3D1504395) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/conf/hive-default.xml.template * /hive/trunk/ql/if/queryplan.thrift * /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan= /api/OperatorType.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator= .java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.ja= va * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.ja= va * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.jav= a * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.j= ava * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtil= s.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.jav= a * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeD= uplication.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/A= bstractCorrelationProcCtx.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/C= orrelationOptimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/C= orrelationUtilities.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/I= ntraQueryCorrelation.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/Q= ueryPlanTreeTransformation.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/R= educeSinkDeDuplication.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Comm= onJoinTaskDispatcher.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.= java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.jav= a * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q * /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.o= ut * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.o= ut * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.o= ut * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.o= ut * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.o= ut * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.ou= t * /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.ou= t * /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml =20 > add a new optimizer for query correlation discovery and optimization > -------------------------------------------------------------------- > > Key: HIVE-2206 > URL: https://issues.apache.org/jira/browse/HIVE-2206 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.12.0 > Reporter: He Yongqiang > Assignee: Yin Huai > Fix For: 0.12.0 > > Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r13850= 84.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.= txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE= -2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-= r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,= HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.= txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,= HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.t= xt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.D11097.10.patch, HIVE-2206.D1= 1097.11.patch, HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, HIVE-2= 206.D11097.14.patch, HIVE-2206.D11097.15.patch, HIVE-2206.D11097.16.patch, = HIVE-2206.D11097.17.patch, HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.p= atch, HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.= 3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D110= 97.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D= 11097.9.patch, HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch > > > This issue proposes a new logical optimizer called Correlation Optimizer,= which is used to merge correlated MapReduce jobs (MR jobs) into a single M= R job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The= paper and slides of YSmart are linked at the bottom. > Since Hive translates queries in a sentence by sentence fashion, for ever= y operation which may need to shuffle the data (e.g. join and aggregation o= perations), Hive will generate a MapReduce job for that operation. However,= for those operations which may need to shuffle the data, they may involve = correlations explained below and thus can be executed in a single MR job. > # Input Correlation: Multiple MR jobs have input correlation (IC) if thei= r input relation sets are not disjoint; > # Transit Correlation: Multiple MR jobs have transit correlation (TC) if = they have not only input correlation, but also the same partition key; > # Job Flow Correlation: An MR has job =EF=AC=82ow correlation (JFC) with = one of its child nodes if it has the same partition key as that child node. > The current implementation of correlation optimizer only detect correlati= ons among MR jobs for reduce-side join operators and reduce-side aggregatio= n operators (not map only aggregation). A query will be optimized if it sat= isfies following conditions. > # There exists a MR job for reduce-side join operator or reduce side aggr= egation operator which have JFC with all of its parents MR jobs (TCs will b= e also exploited if JFC exists); > # All input tables of those correlated MR job are original input tables (= not intermediate tables generated by sub-queries); and=20 > # No self join is involved in those correlated MR jobs. > Correlation optimizer is implemented as a logical optimizer. The main rea= sons are that it only needs to manipulate the query plan tree and it can le= verage the existing component on generating MR jobs. > Current implementation can serve as a framework for correlation related o= ptimizations. I think that it is better than adding individual optimizers.= =20 > There are several work that can be done in future to improve this optimiz= er. Here are three examples. > # Support queries only involve TC; > # Support queries in which input tables of correlated MR jobs involves in= termediate tables; and=20 > # Optimize queries involving self join.=20 > References: > Paper and presentation of YSmart. > Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR= -11-7.pdf > Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira