hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Pullokkaran <jpullokka...@hortonworks.com>
Subject Re: CBO - get cost of the plan
Date Tue, 25 Aug 2015 01:51:27 GMT
>From the text below it seems like you are not collecting table level stats.
You can collect table level stats by following
analyze table <table name> compute statistics;


Thanks
John



On 8/24/15, 6:24 PM, "Raajay" <raajay.v@gmail.com> wrote:

>Hi John,
>
>I am on Hive-2.0.0. I forked of the Hive master branch  2 weeks back
>(commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671).
>
>I actually have "analyze" before I execute the query. I left it out for
>brevity. Please find the entire query (sent to hive in a file) below.
>Without the analyze commands, I find that CBO optimization is ignored as
>expected. Perhaps I am missing some configuration.
>
>I print out the calcite optimized plans, using the "RelOptUtil.toString()"
>helper on "calciteOptimizedPlan" at the end of "apply" function in
>CalcitePlannerAction.
>
>
>- Raajay
>
>
>
>Query
>=====
>
>-- Set the hive configuration
>
>-- clear out the existings tables
>DROP TABLE tableA;
>DROP TABLE tableB;
>DROP TABLE tableC;
>DROP TABLE output_tab;
>
>-- create the tables and load the data
>create external table tableA (a_day int, a_product string, a_alternate
>string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableA;
>
>create external table tableB (b_day int, b_product string, b_alternate
>string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableB;
>
>create external table tableC (c_day int, c_product string, c_alternate
>string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableC_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableC;
>
>-- compute statistics to be used by calcite CBO
>analyze table tableA compute statistics for columns;
>analyze table tableB compute statistics for columns;
>analyze table tableC compute statistics for columns;
>
>-- create output tables
>create table output_tab (a_day int, a_product string, a_alternate string,
>total_sales int);
>
>-- the query
>insert overwrite table output_tab
>select
>a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as
>total_sales
>from
>tableA a join tableB b
>on a.a_day = b.b_day and a.a_product = b.b_product
>join tableC c
>on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>
>
>On Mon, Aug 24, 2015 at 7:25 PM, John Pullokkaran <
>jpullokkaran@hortonworks.com> wrote:
>
>> In addition to col stats you also need table stats.
>>
>> From: John Pullokkaran <jpullokkaran@hortonworks.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Monday, August 24, 2015 at 5:23 PM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Cc: "dev@hive.apache.org" <dev@hive.apache.org>
>> Subject: Re: CBO - get cost of the plan
>>
>> Raajay,
>>
>>     You donĀ¹t have col stats hence it assumes 1 for row count.
>> What version of Hive are you on?
>>
>> Thanks
>> John
>>
>> From: Raajay <raajay.v@gmail.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Monday, August 24, 2015 at 5:19 PM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Cc: "dev@hive.apache.org" <dev@hive.apache.org>
>> Subject: CBO - get cost of the plan
>>
>> Hello,
>>
>> I am interested to get the cost of the query plans as calculated by the
>> CBO. How can I get that information ? For example, consider a query
>>with a
>> three way join of the following form:
>>
>> Query
>> =====
>>
>> insert overwrite table output_tab
>> select
>> a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as
>>total_sales
>> from
>> tableA a join tableB b
>> on a.a_day = b.b_day and a.a_product = b.b_product
>> join tableC c
>> on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>>
>>
>> The number of rows for tableA, tableB, and tableC are of the order of
>> 10000. I believe, that by "analyzing columns" of all the tables Hive
>>will
>> have statistics regarding the number of rows, distinct values, etc.
>> However, when I try to print out the operator tree as determined by the
>> CalcitePlanner, I get the following output.
>>
>> Print out of the Operator Tree
>> ======================
>>
>> HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2],
>> total_sales=[+(+($6, $3), $9)]): rowcount =* 1.0*, cumulative cost =
>>{4.0
>> rows, 0.0 cpu, 0.0 io}, id = 150
>>   HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner],
>> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0,
>> cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148
>>     HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner],
>> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0,
>> cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143
>>       HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2],
>> b_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
>> io}, id = 138
>>         HiveTableScan(table=[[default.tableb]]): rowcount = 1.0,
>> cumulative cost = {0}, id = 44
>>       HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount =
>> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141
>>         HiveTableScan(table=[[default.tablea]]): rowcount = 1.0,
>> cumulative cost = {0}, id = 42
>>     HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount =
>> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
>>       HiveTableScan(table=[[default.tablec]]): rowcount = 1.0,
>>cumulative
>> cost = {0}, id = 47
>>
>>
>> The number of rows as displayed here is 1.0, which is clearly not the
>> correct value.
>>
>> - Raajay.
>>
>>
>>


Mime
View raw message