hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
Date Wed, 17 Jul 2013 22:02:52 GMT

     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Phabricator updated HIVE-2206:
------------------------------

    Attachment: HIVE-2206.D11097.19.patch

yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery
and optimization".

    - address Ashutosh's comments

Reviewers: ashutoshc, JIRA

REVISION DETAIL
  https://reviews.facebook.net/D11097

CHANGE SINCE LAST DIFF
  https://reviews.facebook.net/D11097?vs=35721&id=35841#toc

AFFECTED FILES
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
  conf/hive-default.xml.template
  ql/if/queryplan.thrift
  ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
  ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
  ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
  ql/src/test/queries/clientpositive/correlationoptimizer1.q
  ql/src/test/queries/clientpositive/correlationoptimizer10.q
  ql/src/test/queries/clientpositive/correlationoptimizer11.q
  ql/src/test/queries/clientpositive/correlationoptimizer12.q
  ql/src/test/queries/clientpositive/correlationoptimizer13.q
  ql/src/test/queries/clientpositive/correlationoptimizer14.q
  ql/src/test/queries/clientpositive/correlationoptimizer2.q
  ql/src/test/queries/clientpositive/correlationoptimizer3.q
  ql/src/test/queries/clientpositive/correlationoptimizer4.q
  ql/src/test/queries/clientpositive/correlationoptimizer5.q
  ql/src/test/queries/clientpositive/correlationoptimizer6.q
  ql/src/test/queries/clientpositive/correlationoptimizer7.q
  ql/src/test/queries/clientpositive/correlationoptimizer8.q
  ql/src/test/queries/clientpositive/correlationoptimizer9.q
  ql/src/test/results/clientpositive/correlationoptimizer1.q.out
  ql/src/test/results/clientpositive/correlationoptimizer10.q.out
  ql/src/test/results/clientpositive/correlationoptimizer11.q.out
  ql/src/test/results/clientpositive/correlationoptimizer12.q.out
  ql/src/test/results/clientpositive/correlationoptimizer13.q.out
  ql/src/test/results/clientpositive/correlationoptimizer14.q.out
  ql/src/test/results/clientpositive/correlationoptimizer2.q.out
  ql/src/test/results/clientpositive/correlationoptimizer3.q.out
  ql/src/test/results/clientpositive/correlationoptimizer4.q.out
  ql/src/test/results/clientpositive/correlationoptimizer5.q.out
  ql/src/test/results/clientpositive/correlationoptimizer6.q.out
  ql/src/test/results/clientpositive/correlationoptimizer7.q.out
  ql/src/test/results/clientpositive/correlationoptimizer8.q.out
  ql/src/test/results/clientpositive/correlationoptimizer9.q.out
  ql/src/test/results/compiler/plan/groupby2.q.xml
  ql/src/test/results/compiler/plan/groupby3.q.xml

To: JIRA, ashutoshc, yhuai
Cc: brock

                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt,
HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt,
HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.20-r1434012.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt,
HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, HIVE-2206.D11097.12.patch,
HIVE-2206.D11097.13.patch, HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, HIVE-2206.D11097.16.patch,
HIVE-2206.D11097.17.patch, HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, HIVE-2206.D11097.1.patch,
HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch,
HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch,
testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used
to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart
(http://ysmart.cse.ohio-state.edu/). The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation
which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate
a MapReduce job for that operation. However, for those operations which may need to shuffle
the data, they may involve correlations explained below and thus can be executed in a single
MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation
sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not
only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes
if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR
jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation).
A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator
which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate
tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that
it only needs to manipulate the query plan tree and it can leverage the existing component
on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations.
I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are
three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables;
and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message