hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7156) Group-By operator stat-annotation only uses distinct approx to generate rollups
Date Wed, 01 Oct 2014 21:19:38 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155554#comment-14155554
] 

Gopal V commented on HIVE-7156:
-------------------------------

bq. My point is, it's probably better if we have clean code path if anything is related to
execution engine, but this method doesn't seem resembling anything like that.

Agreed. The CBO rules should be the only ones estimating data sizes moving between operators.

In the process of moving more optimizations into CBO cost-based rules, we'll deprecate these
rules entirely.

> Group-By operator stat-annotation only uses distinct approx to generate rollups
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-7156
>                 URL: https://issues.apache.org/jira/browse/HIVE-7156
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.14.0
>            Reporter: Gopal V
>            Assignee: Prasanth J
>            Priority: Blocker
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7156.1.patch, HIVE-7156.2.patch, HIVE-7156.3.patch, HIVE-7156.4.patch,
HIVE-7156.5.patch, HIVE-7156.6.patch, HIVE-7156.7.patch, HIVE-7156.8.patch, HIVE-7156.8.patch,
HIVE-7156.9.patch, hive-debug.log.bz2
>
>
> The stats annotation for a group-by only annotates the reduce-side row-count with the
distinct values.
> The map-side gets the row-count as the rows output instead of distinct * parallelism,
while the reducer side gets the correct parallelism.
> {code}
> hive> explain select distinct L_SHIPDATE from lineitem;
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: lineitem
>                   Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     expressions: l_shipdate (type: string)
>                     outputColumnNames: l_shipdate
>                     Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats:
COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       keys: l_shipdate (type: string)
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5999989709 Data size: 563999032646 Basic
stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: string)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: string)
>                         Statistics: Num rows: 5999989709 Data size: 563999032646 Basic
stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Reducer 2 
>             Reduce Operator Tree:
>               Group By Operator
>                 keys: KEY._col0 (type: string)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column
stats: COMPLETE
>                 Select Operator
>                   expressions: _col0 (type: string)
>                   outputColumnNames: _col0
>                   Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE
Column stats: COMPLETE
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message