hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <>
Subject [jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved
Date Mon, 21 Mar 2016 21:44:25 GMT


Prasanth Jayachandran updated HIVE-13254:
    Attachment: q3.svg

> GBY cardinality estimation is wrong partition columns is involved
> -----------------------------------------------------------------
>                 Key: HIVE-13254
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.3.0, 2.0.0, 2.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: q3.svg, q3_ef_transpose_aggr.svg
> When running the following query on TPCDS-1000 scale, setting hive.transpose.aggr.join=true
is expected to generate optimal plan but it was not generating. 
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
>        `item`.`i_category`     AS `i_category` 
> FROM   `store_sales` `store_sales` 
>        INNER JOIN `item` `item` 
>                ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
>        INNER JOIN `date_dim` `date_dim` 
>                ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` ) 
> GROUP  BY `d_day_name`, 
>           `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering partition
column into account. For the above query, the generated plan is attached. As we can see from
the plan, GBY is pushed to fact table (store_sales) but that output of GBY shuffled to perform
join instead of MapJoin conversion. This is because of wrong estimation of cardinality/data
size of GBY on store_sales (Map 1). 
> What's happening internally is, GBY computes estimated cardinality which in this case
is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. This estimate is wrong
as ss_sold_date_sk is partition column and estimator assumes its non-partition column. In
this case, not every tasks reads data from all partitions. We need to take estimated task
parallelism into account. For example: If task parallelism is determined to be 100 the estimate
from GBY should be ~6M which should convert this vertex into map join vertex. 

This message was sent by Atlassian JIRA

View raw message