hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Sankar Sivarama Subramaniyan <>
Subject Re: Review Request 43115: HIVE-12924 CBO: Calcite Operator To Hive Operator (Calcite Return Path): TestCliDriver groupby_ppr_multi_distinct.q failure
Date Thu, 04 Feb 2016 21:29:05 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated Feb. 4, 2016, 9:29 p.m.)

Review request for hive, Jesús Camacho Rodríguez and John Pullokkaran.


Thanks John for the review.

The naming convention for the Distinct UDAF field for the GBY in the reduce side : <Last
Reduce Key>:<Current Distinct UDF#>._col_<Distinct Key # in the current Distinct
UDF>. It seems that currently we dont generate the colExprMap correctly for the above convention
in HiveGBOpUtil.genMapSideRS(). The ReduceSide GBY pipeling looks good to me in the current
return path code. Since we are not generating the entries for the correct columns in the MapSide
Reduce Operator, we run into an exception when we look for an entry corresponding to a column
in the reduce side aggreagation.

There is another optimization which can possibly done in the below scenario(after turning
off mapside aggr):
explain FROM srcpart src SELECT count(DISTINCT src.value), count(DISTINCT src.key,src.key),
sum(DISTINCT src.value) WHERE src.ds = '2008-04-08' GROUP BY substr(src.key,1,1);

The Reduce Operator Tree :
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0,
KEY._col1:1._col1), sum(DISTINCT KEY._col1:2._col0)
          keys: KEY._col0 (type: string)
          mode: complete
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Select Operator
As you can see, 
1. KEY._col1:1._col0, KEY._col1:1._col1 is mapped to the same column and hence we could have
used the same column in the rowschema of the ReduceSink Operator pipeline
2. KEY._col1:2._col0,  KEY._col1:0._col0 is mapped to the same column and we can do the same
thing mentioned in 1.

I verified that this happens even in the non-return path code and should be covered as a general
change as a further optimization in a separate jira.


Repository: hive-git


CBO: Calcite Operator To Hive Operator (Calcite Return Path): TestCliDriver groupby_ppr_multi_distinct.q

Diffs (updated)




Precommit runs


Hari Sankar Sivarama Subramaniyan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message