hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1018) pushing down group-by before joins
Date Mon, 23 Aug 2010 22:36:16 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901637#action_12901637

Namit Jain commented on HIVE-1018:

I think a immediate use case is to convert the following job:

select /*+ MAPJOIN(T2) */ count(1) from T1 join T2 on T1.x=T2.y;

into a single MR job, if T2 is small.

> pushing down group-by before joins
> ----------------------------------
>                 Key: HIVE-1018
>                 URL: https://issues.apache.org/jira/browse/HIVE-1018
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
> Queries with both Group-by and Joins are very common and they are expensive operations.
By default Hive evalutes Joins first then group-by. Sometimes it is possible to rewrite queries
to apply group-by (or map-side partial group by) first before join. This will remove a lot
of duplicated keys in joins and alleviate skewness in join keys for this case. This rewrite
should be cost-based. Before we have the stats and the CB framework, we can give users hints
to do the rewrite. 
> A particular case is where the join keys are the same as the grouping keys. Or the group
keys is a superset of the join keys (so that grouping won't affect the result of joins). 
> Examples:
> -- Q1
> select A.key, B.key
> from A join B on (A.key=B.key)
> group by A.key, B.key;
> --Q2
> select distinct A.key, B.key
> from A join B on (A.key=B.key);
> --Q3, aggregation function is sum, count, min, max, (avg and median cannot be handled).
> selec A.key, sum(A.value), count(1), min(value), max(value)
> from A left semi join B on (A.key=B.key)
> group by A.key;
> -- Q4. grouping keys is a superset of join keys
> select distinct A.key, A.value
> from A join B on (A.key=B.key)
> In the case of join keys are not a subset of grouping keys, we can introduce a map-side
partial grouping operator with the keys of the UNION of the join and grouping keys, to remove
unnecessary duplications. This should be cost-based though. 
> Any thoughts and suggestions?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message