hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brock Noland (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7156) Group-By operator stat-annotation only uses distinct approx to generate rollups
Date Sun, 05 Oct 2014 18:29:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159618#comment-14159618
] 

Brock Noland commented on HIVE-7156:
------------------------------------

bq. This works exactly as expected when tez is not being used.

This is not exactly true..this change creates a non-optional dependency on Tez. The store
behind alternative execution engines has always been that they are completely optional. This
is codified by the fact that the tez deps are marked optional.

This code has to be modified to remove the required tez dependency.

> Group-By operator stat-annotation only uses distinct approx to generate rollups
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-7156
>                 URL: https://issues.apache.org/jira/browse/HIVE-7156
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.14.0
>            Reporter: Gopal V
>            Assignee: Prasanth J
>            Priority: Blocker
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7156.1.patch, HIVE-7156.2.patch, HIVE-7156.3.patch, HIVE-7156.4.patch,
HIVE-7156.5.patch, HIVE-7156.6.patch, HIVE-7156.7.patch, HIVE-7156.8.patch, HIVE-7156.8.patch,
HIVE-7156.9.patch, hive-debug.log.bz2
>
>
> The stats annotation for a group-by only annotates the reduce-side row-count with the
distinct values.
> The map-side gets the row-count as the rows output instead of distinct * parallelism,
while the reducer side gets the correct parallelism.
> {code}
> hive> explain select distinct L_SHIPDATE from lineitem;
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: lineitem
>                   Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     expressions: l_shipdate (type: string)
>                     outputColumnNames: l_shipdate
>                     Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats:
COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       keys: l_shipdate (type: string)
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5999989709 Data size: 563999032646 Basic
stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: string)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: string)
>                         Statistics: Num rows: 5999989709 Data size: 563999032646 Basic
stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Reducer 2 
>             Reduce Operator Tree:
>               Group By Operator
>                 keys: KEY._col0 (type: string)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column
stats: COMPLETE
>                 Select Operator
>                   expressions: _col0 (type: string)
>                   outputColumnNames: _col0
>                   Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE
Column stats: COMPLETE
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message