hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
Date Fri, 24 Mar 2017 09:16:41 GMT
Rajesh Balamohan created HIVE-16290:
---------------------------------------

             Summary: Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong
when minValue == filterValue
                 Key: HIVE-16290
                 URL: https://issues.apache.org/jira/browse/HIVE-16290
             Project: Hive
          Issue Type: Bug
          Components: Statistics
            Reporter: Rajesh Balamohan
            Priority: Minor


Issue: 
=====
In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= filtered {{value}},
it should return all rows. Currently, it returns {{numRows/3}}. This lesser number of reducers
to be spun up in queries. E.g Q79 in TPC-DS.


E.g: TPC-DS store table stats:
=============================
{noformat}
hive --orcfiledump hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/000000_0
Stripe Statistics:
  Stripe 1:
    Column 0: count: 1002 hasNull: false
    Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503
    Column 2: count: 1002 hasNull: false min: AAAAAAAAAABAAAAA max: AAAAAAAAPPBAAAAA sum:
16032
    Column 3: count: 1002 hasNull: false min:  max: 2001-03-13 sum: 9950
    Column 4: count: 1002 hasNull: false min:  max: 2001-03-12 sum: 5010
    Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: 669141525
    Column 6: count: 1002 hasNull: false min:  max: pri sum: 3916
    Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970
    Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: 7382689071
    Column 9: count: 1002 hasNull: false min:  max: 8AM-8AM sum: 7088

select compute_stats(s_employee_count, 16) from store;

{"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0,
1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, 4, 5, 7}{0, 1,
2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0,
1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2,
3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 5, 6, 8}"}
{noformat}

{noformat}
explain select count(s_store_sk) from store where s_number_employees > 200 and s_number_employees
< 295;
{noformat}

Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} and then 334
/ 3 = 111 for {{s_number_employees < 295}}. Ideally it should return all 1002 rows for
filter {{s_number_employees > 200}}.


In TPC-DS Q79, this causes too less reduce tasks to be spun up causing runtime delays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message