hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Mollitor (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-22993) Include Bloom Filter in Column Statistics to Better Estimate nDV
Date Fri, 06 Mar 2020 17:25:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053607#comment-17053607
] 

David Mollitor edited comment on HIVE-22993 at 3/6/20, 5:24 PM:
----------------------------------------------------------------

[~gopalv] Thanks.  Do you know what JIRA introduced this change?  I have been testing on HDP
3.1

Edit: Can this BIT_VECTOR field be applied to this request for better stats on INSERT?


was (Author: belugabehr):
[~gopalv] Thanks.  Do you know what JIRA introduced this change?  I have been testing on HDP
3.1

> Include Bloom Filter in Column Statistics to Better Estimate nDV
> ----------------------------------------------------------------
>
>                 Key: HIVE-22993
>                 URL: https://issues.apache.org/jira/browse/HIVE-22993
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Statistics
>            Reporter: David Mollitor
>            Priority: Major
>
> When performing an INSERT statement, Hive has no way to determine the number of distinct
values since the distinct values themselves are not recorded.
> {code:sql}
> create table test_mm(`id` int, `my_dt` date);
> insert into test_mm values (1, "2018-10-01"), (2, "2018-10-01"), (3, "2018-10-01"),
> (4, "2017-10-01"), (5, "2017-10-01"), (6, "2017-10-01"),
> (7, "2010-10-01"), (8, "2010-10-01"), (9, "2010-10-01"),
> (10, "1998-10-01"), (11, "1998-10-01"), (12, "1998-10-01");
> DESCRIBE FORMATTED test_mm my_dt;
> -- distinct_count: 4
> insert into test_mm values (13, "2030-10-01"), (14, "2030-10-01"), (15, "2030-10-01");
> DESCRIBE FORMATTED test_mm my_dt;
> -- distinct_count: 4
> {code}
> The first INSERT statement sees that there are 0 records, so it makes sense that any
distinct values marked in the statistics.  However, for the second INSERT, Hive has no idea
if "2030-10-01" is distinct, so the distinct_count is unchanged.  By introducing a bloom filter
for column statistics, the second INSERT may be able to determine that "2030-10-01" is indeed
unique and update the distinct_count accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message