hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <>
Subject [jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
Date Wed, 17 Jul 2013 02:00:50 GMT


Phabricator commented on HIVE-2206:

ashutoshc has commented on the revision "HIVE-2206 [jira] add a new optimizer for query correlation
discovery and optimization".

  Few more comments. See which of these apply. If they doesn't apply, feel free to ignore.

What does this function do?
Will be good to add comment stating when table == null
It seems like lot of logic here is shared with CommonJoinTaskDispatcher. It will be good to
have that refactored so that its reusable here.
Seems like this method always return true. So, this is not required.
Add a comment saying that tree walking is done and now you will apply transformations which
you have detected.
Do we really need hasBeenRemoved() check?
getKeyCols().size() is not a good check. I will recommend to test explictly for operators
which we are supporting right now.
Do we still need this?
We should do jobFlowCorrelation as another pass in transform().
It will be good to add some ascii art which shows what tree structure we are returning from
this function.
It will good to add javadoc for this method.
Didn't understand this comment. Probably we can word it better.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I dont think we need to
revert to oldTag here. We can keep using newTag.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Doesnt look like you
are using these OIs. Probably we can get rid of these.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ It will be good to add
comments for whats the intent of this for loop.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Why is it called NumOriginalParents?
can it be just numOfParents
  ql/src/java/org/apache/hadoop/hive/ql/exec/ There is already a forward
in Demux, this should not be needed.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ You dont need this constructor
Looks like this map is not used anymore, lets get rid of this.
It will be good to add comments about what this method is intending to do
This method straight away calls another method. We can eliminate this one.




To: JIRA, ashutoshc, yhuai
Cc: brock

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>                 Key: HIVE-2206
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt,
HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt,
HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.20-r1434012.patch.txt,
HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt,
HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt,
HIVE-2206.8-r1237253.patch.txt, HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, HIVE-2206.D11097.12.patch,
HIVE-2206.D11097.13.patch, HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, HIVE-2206.D11097.16.patch,
HIVE-2206.D11097.17.patch, HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch,
HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch,
HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q,
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used
to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart
( The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation
which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate
a MapReduce job for that operation. However, for those operations which may need to shuffle
the data, they may involve correlations explained below and thus can be executed in a single
MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation
sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not
only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes
if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR
jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation).
A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator
which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate
tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that
it only needs to manipulate the query plan tree and it can leverage the existing component
on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations.
I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are
three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables;
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper:
> Slides:

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message