Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Thu, 18 Jul 2013 18:20:53 +0000 (UTC)
From: "Hudson (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12509546.1307515110478.74738.1374171653141@arcas>
In-Reply-To: <JIRA.12509546.1307515110478@arcas>
References: <JIRA.12509546.1307515110478@arcas>
Subject: [jira] [Commented] (HIVE-2206) add a new optimizer for query
 correlation discovery and optimization
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HIVE-2206?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13712=
597#comment-13712597 ]=20

Hudson commented on HIVE-2206:
------------------------------

SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #90 (See [https://builds.ap=
ache.org/job/Hive-trunk-hadoop1-ptest/90/])
HIVE-2206 [jira] add a new optimizer for query correlation discovery and op=
timization
(Yin Huai via Ashutosh Chauhan)

Summary:
update test results

This issue proposes a new logical optimizer called Correlation Optimizer, w=
hich is used to merge correlated MapReduce jobs (MR jobs) into a single MR =
job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The p=
aper and slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every =
operation which may need to shuffle the data (e.g. join and aggregation ope=
rations), Hive will generate a MapReduce job for that operation. However, f=
or those operations which may need to shuffle the data, they may involve co=
rrelations explained below and thus can be executed in a single MR job.

=09Input Correlation: Multiple MR jobs have input correlation (IC) if their=
 input relation sets are not disjoint;
=09Transit Correlation: Multiple MR jobs have transit correlation (TC) if t=
hey have not only input correlation, but also the same partition key;
=09Job Flow Correlation: An MR has job =EF=AC=82ow correlation (JFC) with o=
ne of its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlation=
s among MR jobs for reduce-side join operators and reduce-side aggregation =
operators (not map only aggregation). A query will be optimized if it satis=
fies following conditions.

=09There exists a MR job for reduce-side join operator or reduce side aggre=
gation operator which have JFC with all of its parents MR jobs (TCs will be=
 also exploited if JFC exists);
=09All input tables of those correlated MR job are original input tables (n=
ot intermediate tables generated by sub-queries); and
=09No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reaso=
ns are that it only needs to manipulate the query plan tree and it can leve=
rage the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related opt=
imizations. I think that it is better than adding individual optimizers.

There are several work that can be done in future to improve this optimizer=
. Here are three examples.

=09Support queries only involve TC;
=09Support queries in which input tables of correlated MR jobs involves int=
ermediate tables; and
=09Optimize queries involving self join.

References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-1=
1-7.pdf
Slides: http://sdrv.ms/UpwJJc

Test Plan: EMPTY

Reviewers: JIRA, ashutoshc

Reviewed By: ashutoshc

CC: brock

Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: http=
://svn.apache.org/viewcvs.cgi/?root=3DApache-SVN&view=3Drev&rev=3D1504395)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/if/queryplan.thrift
* /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan=
/api/OperatorType.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator=
.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.ja=
va
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.ja=
va
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.jav=
a
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.j=
ava
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtil=
s.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.jav=
a
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeD=
uplication.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/A=
bstractCorrelationProcCtx.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/C=
orrelationOptimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/C=
orrelationUtilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/I=
ntraQueryCorrelation.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/Q=
ueryPlanTreeTransformation.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/R=
educeSinkDeDuplication.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Comm=
onJoinTaskDispatcher.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.=
java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.jav=
a
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.o=
ut
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.o=
ut
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.o=
ut
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.o=
ut
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.o=
ut
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.ou=
t
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.ou=
t
* /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml

               =20
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>             Fix For: 0.12.0
>
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r13850=
84.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.=
txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE=
-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-=
r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,=
 HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.=
txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt,=
 HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.t=
xt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.D11097.10.patch, HIVE-2206.D1=
1097.11.patch, HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, HIVE-2=
206.D11097.14.patch, HIVE-2206.D11097.15.patch, HIVE-2206.D11097.16.patch, =
HIVE-2206.D11097.17.patch, HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.p=
atch, HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.=
3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D110=
97.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D=
11097.9.patch, HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer,=
 which is used to merge correlated MapReduce jobs (MR jobs) into a single M=
R job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The=
 paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for ever=
y operation which may need to shuffle the data (e.g. join and aggregation o=
perations), Hive will generate a MapReduce job for that operation. However,=
 for those operations which may need to shuffle the data, they may involve =
correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if thei=
r input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if =
they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job =EF=AC=82ow correlation (JFC) with =
one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlati=
ons among MR jobs for reduce-side join operators and reduce-side aggregatio=
n operators (not map only aggregation). A query will be optimized if it sat=
isfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggr=
egation operator which have JFC with all of its parents MR jobs (TCs will b=
e also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (=
not intermediate tables generated by sub-queries); and=20
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main rea=
sons are that it only needs to manipulate the query plan tree and it can le=
verage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related o=
ptimizations. I think that it is better than adding individual optimizers.=
=20
> There are several work that can be done in future to improve this optimiz=
er. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves in=
termediate tables; and=20
> # Optimize queries involving self join.=20
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR=
-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira