hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Pullokkaran <jpullokka...@hortonworks.com>
Subject Re: CBO - get cost of the plan
Date Tue, 25 Aug 2015 00:25:05 GMT
In addition to col stats you also need table stats.

From: John Pullokkaran <jpullokkaran@hortonworks.com<mailto:jpullokkaran@hortonworks.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 5:23 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Cc: "dev@hive.apache.org<mailto:dev@hive.apache.org>" <dev@hive.apache.org<mailto:dev@hive.apache.org>>
Subject: Re: CBO - get cost of the plan

Raajay,

    You don't have col stats hence it assumes 1 for row count.
What version of Hive are you on?

Thanks
John

From: Raajay <raajay.v@gmail.com<mailto:raajay.v@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 5:19 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>>
Cc: "dev@hive.apache.org<mailto:dev@hive.apache.org>" <dev@hive.apache.org<mailto:dev@hive.apache.org>>
Subject: CBO - get cost of the plan

Hello,

I am interested to get the cost of the query plans as calculated by the CBO. How can I get
that information ? For example, consider a query with a three way join of the following form:

Query
=====

insert overwrite table output_tab
select
a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales
from
tableA a join tableB b
on a.a_day = b.b_day and a.a_product = b.b_product
join tableC c
on b.b_day = c.c_day and b.b_alternate = c.c_alternate;


The number of rows for tableA, tableB, and tableC are of the order of 10000. I believe, that
by "analyzing columns" of all the tables Hive will have statistics regarding the number of
rows, distinct values, etc. However, when I try to print out the operator tree as determined
by the CalcitePlanner, I get the following output.

Print out of the Operator Tree
======================

HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], total_sales=[+(+($6, $3), $9)]):
rowcount = 1.0, cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 150
  HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], algorithm=[none], cost=[{2.0
rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id
= 148
    HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], algorithm=[none], cost=[{2.0
rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id
= 143
      HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], b_sales=[$3]): rowcount =
1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 138
        HiveTableScan(table=[[default.tableb]]): rowcount = 1.0, cumulative cost = {0}, id
= 44
      HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = 1.0, cumulative cost
= {0.0 rows, 0.0 cpu, 0.0 io}, id = 141
        HiveTableScan(table=[[default.tablea]]): rowcount = 1.0, cumulative cost = {0}, id
= 42
    HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = 1.0, cumulative cost
= {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
      HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, cumulative cost = {0}, id =
47


The number of rows as displayed here is 1.0, which is clearly not the correct value.

- Raajay.



Mime
View raw message