hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sun Rui (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6120) Add GroupBy optimization to eliminate un-needed partial distinct aggregations
Date Sun, 29 Dec 2013 13:48:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858345#comment-13858345
] 

Sun Rui commented on HIVE-6120:
-------------------------------

review board entry: https://reviews.apache.org/r/16504/

> Add GroupBy optimization to eliminate un-needed partial distinct aggregations
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-6120
>                 URL: https://issues.apache.org/jira/browse/HIVE-6120
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>         Attachments: HIVE-6120.1.patch
>
>
> In most cases, partial distinct aggregation is not needed in map-side groupby. The exception
is that with sorted bucketized tables partial distinct aggregation can be done by the mappers
in some scenarios, as what is done by GroupByOptimzer.
> Currently, partial distinct aggregation is done in the map-side GroupBy and then shuffle
of the partial result is done in the following ReduceSink operator, in cases where they are
not needed. This wastes CPU cycles, memory and network bandwidth.
> This optimization eliminates un-needed partial distinct aggregations, which improves
performance and reduces memory usage.
> For example,
> EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;
> Before optimization:
> {noformat}
>               Group By Operator
>                 aggregations:
>                       expr: count(DISTINCT value)
>                 bucketGroup: false
>                 keys:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1, _col2
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: int
>                         expr: _col1
>                         type: string
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: int
>                   tag: -1
>                   value expressions:
>                         expr: _col2
>                         type: bigint
> {noformat}
> After optimization:
> {noformat}
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: int
>                         expr: _col1
>                         type: string
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: int
>                   tag: -1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message