hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Pullokkaran <jpullokka...@hortonworks.com>
Subject Re: CBO - get cost of the plan
Date Tue, 25 Aug 2015 19:08:32 GMT
#1 The row count estimate for  "tableA" inner join “tableC"
This depends on the selectivity of Join.
The formula is Cardinality(A) * Cardinality(C) * Selectivity
We do have logic to infer PK-FK relation ship based on cardinality & NDV.

#2 what is the definition of cumulative cost
This is the total cost bottom up (including the current OP).

Thanks
John

From: Raajay <raajay.v@gmail.com<mailto:raajay.v@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 8:09 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: CBO - get cost of the plan

Ah okay. Thanks a lot! Now I can get non-default values after collecting table level stats.


"CalciteOptimizedPlan" looks like this:

HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], total_sales=[+(+($6, $3), $9)]):
rowcount = 6.112516920555744E9, cumulative cost = {3.8838122487221785E7 rows, 0.0 cpu, 0.0
io}, id = 155
  HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], algorithm=[none], cost=[{2.4838122487221785E7
rows, 0.0 cpu, 0.0 io}]): rowcount = 6.112516920555744E9, cumulative cost = {3.8838122487221785E7
rows, 0.0 cpu, 0.0 io}, id = 153
    HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], algorithm=[none], cost=[{1.4E7
rows, 0.0 cpu, 0.0 io}]): rowcount = 1.7628122487221785E7, cumulative cost = {1.4E7 rows,
0.0 cpu, 0.0 io}, id = 148
      HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], b_sales=[$3]): rowcount =
7000000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 143
        HiveTableScan(table=[[default.tableb]]): rowcount = 7000000.0, cumulative cost = {0},
id = 44
      HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = 7000000.0, cumulative
cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
        HiveTableScan(table=[[default.tablea]]): rowcount = 7000000.0, cumulative cost = {0},
id = 42
    HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = 7210000.0, cumulative
cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 151
      HiveTableScan(table=[[default.tablec]]): rowcount = 7210000.0, cumulative cost = {0},
id = 47


The row count estimate for  "tableA" inner join "tableC" is higher than expected. Ideally,
the size of the join should be less than the cardinality of the smallest table, because the
keys over which we join are distinct in each table. But here the estimates are

|tableA| = |tableB| = 7E6 and |tableA join tableB| = 1.76E7 (values highlighted in red in
the log snippet above)

Should I be able to explicitly specify it somewhere, so thats gathering is accurate ?

Also, what is the definition of cumulative cost ?

Thanks for the help,
Raajay

On Mon, Aug 24, 2015 at 8:51 PM, John Pullokkaran <jpullokkaran@hortonworks.com<mailto:jpullokkaran@hortonworks.com>>
wrote:
From the text below it seems like you are not collecting table level stats.
You can collect table level stats by following
analyze table <table name> compute statistics;


Thanks
John



On 8/24/15, 6:24 PM, "Raajay" <raajay.v@gmail.com<mailto:raajay.v@gmail.com>>
wrote:

>Hi John,
>
>I am on Hive-2.0.0. I forked of the Hive master branch  2 weeks back
>(commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671).
>
>I actually have "analyze" before I execute the query. I left it out for
>brevity. Please find the entire query (sent to hive in a file) below.
>Without the analyze commands, I find that CBO optimization is ignored as
>expected. Perhaps I am missing some configuration.
>
>I print out the calcite optimized plans, using the "RelOptUtil.toString()"
>helper on "calciteOptimizedPlan" at the end of "apply" function in
>CalcitePlannerAction.
>
>
>- Raajay
>
>
>
>Query
>=====
>
>-- Set the hive configuration
>
>-- clear out the existings tables
>DROP TABLE tableA;
>DROP TABLE tableB;
>DROP TABLE tableC;
>DROP TABLE output_tab;
>
>-- create the tables and load the data
>create external table tableA (a_day int, a_product string, a_alternate
>string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableA;
>
>create external table tableB (b_day int, b_product string, b_alternate
>string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableB;
>
>create external table tableC (c_day int, c_product string, c_alternate
>string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableC_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableC;
>
>-- compute statistics to be used by calcite CBO
>analyze table tableA compute statistics for columns;
>analyze table tableB compute statistics for columns;
>analyze table tableC compute statistics for columns;
>
>-- create output tables
>create table output_tab (a_day int, a_product string, a_alternate string,
>total_sales int);
>
>-- the query
>insert overwrite table output_tab
>select
>a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as
>total_sales
>from
>tableA a join tableB b
>on a.a_day = b.b_day and a.a_product = b.b_product
>join tableC c
>on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>
>
>On Mon, Aug 24, 2015 at 7:25 PM, John Pullokkaran <
>jpullokkaran@hortonworks.com<mailto:jpullokkaran@hortonworks.com>> wrote:
>
>> In addition to col stats you also need table stats.
>>
>> From: John Pullokkaran <jpullokkaran@hortonworks.com<mailto:jpullokkaran@hortonworks.com>>
>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
>> Date: Monday, August 24, 2015 at 5:23 PM
>> To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
>> Cc: "dev@hive.apache.org<mailto:dev@hive.apache.org>" <dev@hive.apache.org<mailto:dev@hive.apache.org>>
>> Subject: Re: CBO - get cost of the plan
>>
>> Raajay,
>>
>>     You don¹t have col stats hence it assumes 1 for row count.
>> What version of Hive are you on?
>>
>> Thanks
>> John
>>
>> From: Raajay <raajay.v@gmail.com<mailto:raajay.v@gmail.com>>
>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
>> Date: Monday, August 24, 2015 at 5:19 PM
>> To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
>> Cc: "dev@hive.apache.org<mailto:dev@hive.apache.org>" <dev@hive.apache.org<mailto:dev@hive.apache.org>>
>> Subject: CBO - get cost of the plan
>>
>> Hello,
>>
>> I am interested to get the cost of the query plans as calculated by the
>> CBO. How can I get that information ? For example, consider a query
>>with a
>> three way join of the following form:
>>
>> Query
>> =====
>>
>> insert overwrite table output_tab
>> select
>> a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as
>>total_sales
>> from
>> tableA a join tableB b
>> on a.a_day = b.b_day and a.a_product = b.b_product
>> join tableC c
>> on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>>
>>
>> The number of rows for tableA, tableB, and tableC are of the order of
>> 10000. I believe, that by "analyzing columns" of all the tables Hive
>>will
>> have statistics regarding the number of rows, distinct values, etc.
>> However, when I try to print out the operator tree as determined by the
>> CalcitePlanner, I get the following output.
>>
>> Print out of the Operator Tree
>> ======================
>>
>> HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2],
>> total_sales=[+(+($6, $3), $9)]): rowcount =* 1.0*, cumulative cost =
>>{4.0
>> rows, 0.0 cpu, 0.0 io}, id = 150
>>   HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner],
>> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0,
>> cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148
>>     HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner],
>> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0,
>> cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143
>>       HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2],
>> b_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
>> io}, id = 138
>>         HiveTableScan(table=[[default.tableb]]): rowcount = 1.0,
>> cumulative cost = {0}, id = 44
>>       HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount =
>> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141
>>         HiveTableScan(table=[[default.tablea]]): rowcount = 1.0,
>> cumulative cost = {0}, id = 42
>>     HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount =
>> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
>>       HiveTableScan(table=[[default.tablec]]): rowcount = 1.0,
>>cumulative
>> cost = {0}, id = 47
>>
>>
>> The number of rows as displayed here is 1.0, which is clearly not the
>> correct value.
>>
>> - Raajay.
>>
>>
>>


Mime
View raw message