hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <>
Subject [jira] Commented: (HIVE-1018) pushing down group-by before joins
Date Wed, 23 Jun 2010 17:10:51 GMT


Ning Zhang commented on HIVE-1018:

Good points Joy. It will be interesting to see what are the typical use cases you have combining
join and GroupBy. Previous what in my mind here is to optimize away the very bad case of skewness
in the join (many rows with the same join key).  Since GroupBy eliminates the skewness, these
rewrite rules push down GroupBy before JOIN for these special cases. What you have mentioned
are definitely what we should optimize for these cases. The are helpful for the general cases
(non-skewed join) as well. 

> pushing down group-by before joins
> ----------------------------------
>                 Key: HIVE-1018
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
> Queries with both Group-by and Joins are very common and they are expensive operations.
By default Hive evalutes Joins first then group-by. Sometimes it is possible to rewrite queries
to apply group-by (or map-side partial group by) first before join. This will remove a lot
of duplicated keys in joins and alleviate skewness in join keys for this case. This rewrite
should be cost-based. Before we have the stats and the CB framework, we can give users hints
to do the rewrite. 
> A particular case is where the join keys are the same as the grouping keys. Or the group
keys is a superset of the join keys (so that grouping won't affect the result of joins). 
> Examples:
> -- Q1
> select A.key, B.key
> from A join B on (A.key=B.key)
> group by A.key, B.key;
> --Q2
> select distinct A.key, B.key
> from A join B on (A.key=B.key);
> --Q3, aggregation function is sum, count, min, max, (avg and median cannot be handled).
> selec A.key, sum(A.value), count(1), min(value), max(value)
> from A left semi join B on (A.key=B.key)
> group by A.key;
> -- Q4. grouping keys is a superset of join keys
> select distinct A.key, A.value
> from A join B on (A.key=B.key)
> In the case of join keys are not a subset of grouping keys, we can introduce a map-side
partial grouping operator with the keys of the UNION of the join and grouping keys, to remove
unnecessary duplications. This should be cost-based though. 
> Any thoughts and suggestions?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message