hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal Vijayaraghavan (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-22448) CBO: Expand the multiple count distinct with a group-by key
Date Fri, 08 Nov 2019 06:51:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-22448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969880#comment-16969880
] 

Gopal Vijayaraghavan commented on HIVE-22448:
---------------------------------------------

bq.  I am trying to understand why this improvement is always a good idea

Unless we change other parts of the shuffle implementation, this is a good idea right now
(specifically, implement a rack level combiner & we can do better than a grouping set
+ hash aggregate).

bq. along with grouping set (which increases the data size)

We're moving around where the data-size increase is actually happening.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374

bq. adding an extra reducer

That is true, but the distribution function is different between these two end points which
is why we end up getting performance improvements due to distribution function changes with
this implementation when the group by key has a low nDV (like a date) and the count distinct
key has a high nDV (like a user_id).

That is a distributed SQL engine specific improvement where shuffling on a high nDV column
always uses more cpu cores better than a shuffle on a low nDV column (in the query example,
we send the z & x keys across all reducers in the first pass which prevents a skew forming
there).

The no-key group-by is the extreme case of that, which we fixed before, which had only 1 reducer
effectively receiving every single key in the count distinct.

That was originally fixed up in the physical optimizer, but I prefer the Calcite approach
as it is a better place for this.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64

bq. If so, isn't it better to make this optimization statistics based?

Everything is better that way - the statistics aren't absolute and are estimate driven, so
the fact that this optimization right now gives us an improvement on the per-row cost which
is not accounted for in CBO right now (as in the constant multiplier against the row doesn't
change).

As I mentioned before, this change indirectly corrects for a potential skew in the group-by
key which can only be estimated if we have histograms.

> CBO: Expand the multiple count distinct with a group-by key
> -----------------------------------------------------------
>
>                 Key: HIVE-22448
>                 URL: https://issues.apache.org/jira/browse/HIVE-22448
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-22448.01.patch, HIVE-22448.02.patch, HIVE-22448.02.patch, HIVE-22448.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> create temporary table mytable1 (x integer, y integer, z integer, a integer);
> explain cbo
> select z, x, count(distinct y), count(distinct a)
> from mytable1
> group by z, x;
> explain cbo
> select count(distinct y), count(distinct a)
> from mytable1
> {code}
> The first is not vectorized, the second one is because of the grouping-set based rewrite
for count distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message