hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
Date Mon, 30 Jul 2018 23:54:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562680#comment-16562680
] 

Ashutosh Chauhan commented on HIVE-20260:
-----------------------------------------

I am not sure this mechanism of {{affectedColumns}} will help. IIUC, effect of this is that
column stats will get updated only for columns involved in filter expression. Columns which
are part of filter operator but are absent from expressions will have their column stats unchanged.
If so, that is not what we want. What we want is to update stats for all columns of filter
operator which is the case before this patch but change the logic of how they are updated.
Before the patch, every condition of filter results in scale down of col stats. That is we
assumed each filter condition independently filters out diff rows. However, If instead we
assume that diff filter conditions filter overlapping rows than we scale down col stats of
column involved in condition only with row count decreased by it. Columns not involved in
condition we can scale down col stats by max decrease of one of conditions.
Perhaps, an example will help. Lets say we have :
{code}
create table t1 (a int, b int, c int);
-- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200
select a,b,c from t1 where a = 20 and b = 30;
{code}

Here, when we process a = 20; 
rowcount = 500 / 2 = 250. ndv(a) = 100 * (250/500) = 50. ndv(b) = 50 ndv(c) = 200. b and c's
ndv unchanged.
Then we process b = 30.
rowcount = 250/2 = 125. ndv(b) = 50 * (125/250) = 25. ndv(a) = 50 ndv(c) = 200. a and c's
ndv unchanged.

For b and c we are done since we updated their column stats. For c (columns not included in
filter condition) we updated with largest factor change brought. Here that means 200 * (1/2)
= 100

Logic before this patch would have resulted in change of (125/500) = 1/4 ndv for every column.


Apart from above, second issue is this scaling happen twice: once when filter expression is
processed [1] and then when operator stats are updated[2] . That looks incorrect we should
perhaps remove one of these calls.
[1] : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L355
[2] : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L299

> NDV of a column shouldn't be scaled when row count is changed by filter on another column
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-20260
>                 URL: https://issues.apache.org/jira/browse/HIVE-20260
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>            Reporter: Ashutosh Chauhan
>            Assignee: Zoltan Haindrich
>            Priority: Major
>         Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of multiple filters.
HIVE-19500 improved on that by also scaling col stats (NDV) in such scenario. However, it
should pay attention to column used in filter expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by row count
changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at the end of
computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message