hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoltan Haindrich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
Date Tue, 31 Jul 2018 08:21:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563292#comment-16563292
] 

Zoltan Haindrich commented on HIVE-20260:
-----------------------------------------

I feel that considering all column uncorreleated makes more sense than thinking about them
as fully correlated. I've written a test (stat_estimate_drill.q) which have outputted much
worse results without this patch especially in cases like the above example.

The current patch does a little bit different than the above; but to be on the same page -
I would like to show the old behaviour as well:
{code}
create table t1 (a int, b int, c int);
-- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200
select a,b,c from t1 where a = 20 and b = 30;
{code}

old (uniform scaling)
|              | rowCount                | ratio          | ndv(a) | ndv(b) | ndv(c) |
|              | 500                     |                | 100    | 50     | 200 |
| a=20         | 1/ndv(a) * rowCount = 5 | 1/ndv(a) = .01 | 1      | .5=>1  | 2 |
| a=20 && b=30 | 1/ndv(b) * rowCount = 5 | 1/ndv(b) = 1   | 1      | 1      | 2 |

the problem with the above is that it have lost diversity of column b and c.

patch
|              | rowCount                | ratio          | ndv(a) | ndv(b) | ndv(c) |
|              | 500                     |                | 100    | 50     | 200 |
| a=20         | 1/ndv(a) * rowCount = 5 | 1/ndv(a) = .01 | 1      | 50  *   | 200 * |
| a=20 && b=30 | 1/ndv(b) * rowCount = .1 => 1 | 1/ndv(b) = .02   | 1      | 1
     | 200 * |

I think it would make sense to limit ndv to rowcount at * places...since it's not possible
to have that many anymore...

About the second note: the patch already takes care of ands more-or-less correctly by clearing
the affected columns when it start evaulating an And:
https://github.com/apache/hive/compare/master...kgyrtkirk:HIVE-20260-stat-ndv#diff-11eb46db88b11b0c0fe63fb1a919f174R350

I think it would be possible to introduce a slider to enable to make this configurable; but
I'm not sure if there are any people who would be wanting to change it...right now I think
it would be better to use the uncorrelated model.


> NDV of a column shouldn't be scaled when row count is changed by filter on another column
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-20260
>                 URL: https://issues.apache.org/jira/browse/HIVE-20260
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>            Reporter: Ashutosh Chauhan
>            Assignee: Zoltan Haindrich
>            Priority: Major
>         Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of multiple filters.
HIVE-19500 improved on that by also scaling col stats (NDV) in such scenario. However, it
should pay attention to column used in filter expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by row count
changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at the end of
computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message