hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-8196) Joining on partition columns with fetch column stats enabled results it very small CE which negatively affects query performance
Date Fri, 19 Sep 2014 22:31:35 GMT

     [ https://issues.apache.org/jira/browse/HIVE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mostafa Mokhtar updated HIVE-8196:
----------------------------------
    Description: 
To make the best out of dynamic partition pruning joins should be on the partitioning columns
which results in dynamically pruning the partitions from the fact table based on the qualifying
column keys from the dimension table, this type of joins negatively effects on cardinality
estimates with fetch column stats enabled.
Currently we don't have statistics for partition columns and as a result NDV is set to row
count, doing that negatively affects the estimated join selectivity from the join.

Workaround is to capture statistics for partition columns or use number of partitions incase
dynamic partitioning is used.

In StatsUtils.getColStatisticsFromExpression is where count distincts gets set to row count

{code}
  if (encd.getIsPartitionColOrVirtualCol()) {

        // vitual columns
        colType = encd.getTypeInfo().getTypeName();
        countDistincts = numRows;
        oi = encd.getWritableObjectInspector();
{code}


Query used to repro the issue :
{code}
set hive.stats.fetch.column.stats=ture;
set hive.tez.dynamic.partition.pruning=true;
explain select d_date 
from store_sales, date_dim 
where 
store_sales.ss_sold_date_sk = date_dim.d_date_sk and 
date_dim.d_year = 1998;
{code}

Plan 
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 1 <- Map 2 (BROADCAST_EDGE)
      DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_sold_date_sk is not null (type: boolean)
                  Statistics: Num rows: 550076554 Data size: 47370018816 Basic stats: COMPLETE
Column stats: COMPLETE
                  Map Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {ss_sold_date_sk}
                      1 {d_date_sk} {d_date}
                    keys:
                      0 ss_sold_date_sk (type: int)
                      1 d_date_sk (type: int)
                    outputColumnNames: _col22, _col26, _col28
                    input vertices:
                      1 Map 2
                    Statistics: Num rows: 652 Data size: 66504 Basic stats: COMPLETE Column
stats: COMPLETE
                    Filter Operator
                      predicate: (_col22 = _col26) (type: boolean)
                      Statistics: Num rows: 326 Data size: 33252 Basic stats: COMPLETE Column
stats: COMPLETE
                      Select Operator
                        expressions: _col28 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 326 Data size: 30644 Basic stats: COMPLETE Column
stats: COMPLETE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 326 Data size: 30644 Basic stats: COMPLETE
Column stats: COMPLETE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Execution mode: vectorized
        Map 2
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_date_sk is not null and (d_year = 1998)) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column
stats: COMPLETE
                  Filter Operator
                    predicate: (d_date_sk is not null and (d_year = 1998)) (type: boolean)
                    Statistics: Num rows: 652 Data size: 66504 Basic stats: COMPLETE Column
stats: COMPLETE
                    Reduce Output Operator
                      key expressions: d_date_sk (type: int)
                      sort order: +
                      Map-reduce partition columns: d_date_sk (type: int)
                      Statistics: Num rows: 652 Data size: 66504 Basic stats: COMPLETE Column
stats: COMPLETE
                      value expressions: d_date (type: string)
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 652 Data size: 0 Basic stats: PARTIAL Column stats:
COMPLETE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 652 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
                        Dynamic Partitioning Event Operator
                          Target Input: store_sales
                          Partition key expr: ss_sold_date_sk
                          Statistics: Num rows: 652 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
                          Target column: ss_sold_date_sk
                          Target Vertex: Map 1

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

  was:
To make the best out of dynamic partition pruning joins should be on the partitioning columns
which results in dynamically pruning the partitions from the fact table based on the qualifying
column keys from the dimension table, this type of joins negatively effects on cardinality
estimates with fetch column stats enabled.
Currently we don't have statistics for partition columns and as a result NDV is set to row
count, doing that negatively affects the estimated join selectivity from the join.

Workaround is to capture statistics for partition columns or use number of partitions incase
dynamic partitioning is used.

In StatsUtils.getColStatisticsFromExpression is where count distincts gets set to row count

{code}
  if (encd.getIsPartitionColOrVirtualCol()) {

        // vitual columns
        colType = encd.getTypeInfo().getTypeName();
        countDistincts = numRows;
        oi = encd.getWritableObjectInspector();
{code}


Query used to repro the issue :
{code}
set hive.stats.fetch.column.stats=ture;
set hive.tez.dynamic.partition.pruning=true;
explain select d_date 
from store_sales, date_dim 
where 
store_sales.ss_sold_date_sk = date_dim.d_date_sk and 
date_dim.d_year = 1998;
{code}

Plan 


> Joining on partition columns with fetch column stats enabled results it very small CE
which negatively affects query performance 
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8196
>                 URL: https://issues.apache.org/jira/browse/HIVE-8196
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 0.14.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Prasanth J
>            Priority: Critical
>              Labels: performance
>             Fix For: 0.14.0
>
>
> To make the best out of dynamic partition pruning joins should be on the partitioning
columns which results in dynamically pruning the partitions from the fact table based on the
qualifying column keys from the dimension table, this type of joins negatively effects on
cardinality estimates with fetch column stats enabled.
> Currently we don't have statistics for partition columns and as a result NDV is set to
row count, doing that negatively affects the estimated join selectivity from the join.
> Workaround is to capture statistics for partition columns or use number of partitions
incase dynamic partitioning is used.
> In StatsUtils.getColStatisticsFromExpression is where count distincts gets set to row
count 
> {code}
>   if (encd.getIsPartitionColOrVirtualCol()) {
>         // vitual columns
>         colType = encd.getTypeInfo().getTypeName();
>         countDistincts = numRows;
>         oi = encd.getWritableObjectInspector();
> {code}
> Query used to repro the issue :
> {code}
> set hive.stats.fetch.column.stats=ture;
> set hive.tez.dynamic.partition.pruning=true;
> explain select d_date 
> from store_sales, date_dim 
> where 
> store_sales.ss_sold_date_sk = date_dim.d_date_sk and 
> date_dim.d_year = 1998;
> {code}
> Plan 
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Map 1 <- Map 2 (BROADCAST_EDGE)
>       DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: ss_sold_date_sk is not null (type: boolean)
>                   Statistics: Num rows: 550076554 Data size: 47370018816 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     condition expressions:
>                       0 {ss_sold_date_sk}
>                       1 {d_date_sk} {d_date}
>                     keys:
>                       0 ss_sold_date_sk (type: int)
>                       1 d_date_sk (type: int)
>                     outputColumnNames: _col22, _col26, _col28
>                     input vertices:
>                       1 Map 2
>                     Statistics: Num rows: 652 Data size: 66504 Basic stats: COMPLETE
Column stats: COMPLETE
>                     Filter Operator
>                       predicate: (_col22 = _col26) (type: boolean)
>                       Statistics: Num rows: 326 Data size: 33252 Basic stats: COMPLETE
Column stats: COMPLETE
>                       Select Operator
>                         expressions: _col28 (type: string)
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 326 Data size: 30644 Basic stats: COMPLETE
Column stats: COMPLETE
>                         File Output Operator
>                           compressed: false
>                           Statistics: Num rows: 326 Data size: 30644 Basic stats: COMPLETE
Column stats: COMPLETE
>                           table:
>                               input format: org.apache.hadoop.mapred.TextInputFormat
>                               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>             Execution mode: vectorized
>         Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: date_dim
>                   filterExpr: (d_date_sk is not null and (d_year = 1998)) (type: boolean)
>                   Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE
Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (d_date_sk is not null and (d_year = 1998)) (type: boolean)
>                     Statistics: Num rows: 652 Data size: 66504 Basic stats: COMPLETE
Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: d_date_sk (type: int)
>                       sort order: +
>                       Map-reduce partition columns: d_date_sk (type: int)
>                       Statistics: Num rows: 652 Data size: 66504 Basic stats: COMPLETE
Column stats: COMPLETE
>                       value expressions: d_date (type: string)
>                     Select Operator
>                       expressions: d_date_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 652 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
>                       Group By Operator
>                         keys: _col0 (type: int)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 652 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
>                         Dynamic Partitioning Event Operator
>                           Target Input: store_sales
>                           Partition key expr: ss_sold_date_sk
>                           Statistics: Num rows: 652 Data size: 0 Basic stats: PARTIAL
Column stats: COMPLETE
>                           Target column: ss_sold_date_sk
>                           Target Vertex: Map 1
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message