hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1018) pushing down group-by before joins
Date Tue, 29 Dec 2009 18:30:29 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795098#action_12795098

Ning Zhang commented on HIVE-1018:

Thanks for the link Yongqiang. The paper mostly discusses the optimization of extending join
tree with group by and deciding how to order the join-groupby tree. Since we don't have stats
and costing framework available, and we handle multiway joins rather than binary joins, join
ordering is not the major issue. The only optimization part is whether do groupby first or
join first. As the first step, we can implement a hint to pushdown grouping operator first.

> pushing down group-by before joins
> ----------------------------------
>                 Key: HIVE-1018
>                 URL: https://issues.apache.org/jira/browse/HIVE-1018
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
> Queries with both Group-by and Joins are very common and they are expensive operations.
By default Hive evalutes Joins first then group-by. Sometimes it is possible to rewrite queries
to apply group-by (or map-side partial group by) first before join. This will remove a lot
of duplicated keys in joins and alleviate skewness in join keys for this case. This rewrite
should be cost-based. Before we have the stats and the CB framework, we can give users hints
to do the rewrite. 
> A particular case is where the join keys are the same as the grouping keys. Or the group
keys is a superset of the join keys (so that grouping won't affect the result of joins). 
> Examples:
> -- Q1
> select A.key, B.key
> from A join B on (A.key=B.key)
> group by A.key, B.key;
> --Q2
> select distinct A.key, B.key
> from A join B on (A.key=B.key);
> --Q3, aggregation function is sum, count, min, max, (avg and median cannot be handled).
> selec A.key, sum(A.value), count(1), min(value), max(value)
> from A left semi join B on (A.key=B.key)
> group by A.key;
> -- Q4. grouping keys is a superset of join keys
> select distinct A.key, A.value
> from A join B on (A.key=B.key)
> In the case of join keys are not a subset of grouping keys, we can introduce a map-side
partial grouping operator with the keys of the UNION of the join and grouping keys, to remove
unnecessary duplications. This should be cost-based though. 
> Any thoughts and suggestions?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message