hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-609) optimize multi-group by
Date Fri, 17 Jul 2009 22:42:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732765#action_12732765
] 

Zheng Shao commented on HIVE-609:
---------------------------------

@hive.609.2.patch: Reviewed with namit offline. Here are the comments:
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:89	Used by hash distinct aggregation
when hashGrpKeyNotRedKey is true
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:98	shall we rename it to "reduceGroupKeyIsDistinctExpr".
I think this is more accurate than "groupbyKeyIsNotReduceKey"
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:401	I just found that "hashAggr"
and "hashDistinctAggr" are always used together. We only need to pass one parameter to this
function. What about some javadoc for this function (mainly for these parameters because I
think they are not easy to understand)?
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3377	change name to getCommonDistinctExpr
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3397	assert not valid
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3404	instead of computing
colExprMap for every reduceSinkDesc(), it can be computed offline all the info is available
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3462	this check is not needed
ql/src/java/org/apache/hadoop/hive/ql/plan/exprNodeDesc.java:40	more comments 
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:1590	more comments boolean
var to track distPartAgg || (... DIST) In one case, the partial results have already been
computed
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3314	merge with optimizeGroupby

> optimize multi-group by 
> ------------------------
>
>                 Key: HIVE-609
>                 URL: https://issues.apache.org/jira/browse/HIVE-609
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.609.1.patch, hive.609.2.patch
>
>
> For query like:
> from src
> insert overwrite table dest1 select col1, count(distinct colx) group by col1
> insert overwrite table dest2 select col2, count(distinct colx) group by col2;
> If map side aggregation is turned off, we currently do 4 map-reduce jobs.
> The plan can be optimized by running it in 3 map-reduce jobs, by spraying over the
> distinct column first and then aggregating individual results.
> This may not be possible if there are multiple distinct columns, but the above query
is very common
> in data warehousing environments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message