hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <>
Subject [jira] [Updated] (HIVE-7913) Simplify predicates for CBO
Date Fri, 29 Aug 2014 18:05:53 GMT


Mostafa Mokhtar updated HIVE-7913:

    Assignee: Laljo John Pullokkaran

> Simplify predicates for CBO
> ---------------------------
>                 Key: HIVE-7913
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 0.13.1
>            Reporter: Mostafa Mokhtar
>            Assignee: Laljo John Pullokkaran
>             Fix For: 0.14.0
> I noticed that the estimate number of rows in Map joins is higher after the join than
before the join that is with column stats fetch ON or OFF.
> TPC-DS Q55 was a good example for that, the issue is that the current statistics provide
us enough information that we can estimate with strong confidence that the joins are one to
many and not many to many.
> Joining store_sales x item on ss_item_sk = i_item_sk, we know that the NDV, min and max
values for both join columns match while the row counts are different this pattern indicates
a PK/FK relationship between store_sales and item.
> Yet when a filter is applied on item and reduces the number of rows from 462K to 7K we
estimate a many to many join between the filtered item and store_sales and as a result the
estimate number of rows coming out of the join is off by several orders of magnitude.
> Available information from the stats
> {code}
> Table		Join column	NDV from describe		NDV actual	min		max
> item		i_item_sk	439,501				462,000		1		462,000
> date_dim	d_date_sk	65,332				73,049		2,415,022	2,488,070
> store_sales	ss_item_sk	439,501				462,000		1		462,000
> store_sales	ss_sold_date_sk	2,226				1,823		2,450,816	2,452,642
> {code}
> Same thing applies to store_sales and date_dim but with a caveat that the NDV , min and
max values don't match where date_dim has a bigger domain and accordingly a higher NDV count.
> For joining store_sales and item on on ss_item_sk = i_item_sk since both columns have
the same NDV, min and max values we can safely conclude that selectivity on item will translate
to similar selectivity on store_sales.
> This is not the case for joining store_sales and date_dim on ss_sold_date_sk = d_date_sk
since the domain of d_date_sk is much bigger than that of ss_sold_date_sk, differences in
domain need to be taken into account when inferring selectivity onto store_sales.

This message was sent by Atlassian JIRA

View raw message