hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-12094) nDV of aggregate columns tend to be log scale - not unique
Date Mon, 12 Oct 2015 12:26:05 GMT
Gopal V created HIVE-12094:
------------------------------

             Summary: nDV of aggregate columns tend to be log scale - not unique
                 Key: HIVE-12094
                 URL: https://issues.apache.org/jira/browse/HIVE-12094
             Project: Hive
          Issue Type: Improvement
          Components: Statistics
    Affects Versions: 1.3.0, 2.0.0
            Reporter: Gopal V


Stats for aggregate columns do not process properly if declared as a simple nDV

{code}
select count(distinct l_suppkey) from lineitem group by l_orderkey having count(distinct l_suppkey)
 = 1
{code}

will mis-estimate the cardinality of the output by a significant margin.

The log-scale of the nDV in general skews towards a very low number, which is not accounted
for in the StatsRulesProcFactory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message