hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-8517) When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression
Date Sun, 19 Oct 2014 18:26:33 GMT
Mostafa Mokhtar created HIVE-8517:
-------------------------------------

             Summary: When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression
                 Key: HIVE-8517
                 URL: https://issues.apache.org/jira/browse/HIVE-8517
             Project: Hive
          Issue Type: Bug
          Components: Physical Optimizer
    Affects Versions: 0.14.0
            Reporter: Mostafa Mokhtar
             Fix For: 0.14.0


When joining on partition column number of partitions is used as NDV which gets overridden
by StatsUtils.getColStatisticsFromExpression and the number of partitions used as NDV is replaced
by number of rows which results in the same behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196.
"Joining on partition columns with fetch column stats enabled results it very small CE which
negatively affects query performance "

This is the call stack.
{code}
StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) line: 1001	
StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, NodeProcessorCtx,
Object...) line: 1479	
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90	
PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94	
PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78	
PreOrderWalker.walk(Node) line: 54	
PreOrderWalker.walk(Node) line: 59	
PreOrderWalker.walk(Node) line: 59	
PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>)
line: 109	
AnnotateWithStatistics.transform(ParseContext) line: 78	
TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248	
TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, Set<WriteEntity>)
line: 120	
TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>,
HashSet<WriteEntity>) line: 99	
SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037	
SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74	
ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
Driver.compile(String, boolean) line: 415	
{code}

Query
{code}

select
  ss_item_sk item_sk, d_date, sum(ss_sales_price),
  sum(sum(ss_sales_price))
      over (partition by ss_item_sk order by d_date rows between unbounded preceding and current
row) cume_sales
from store_sales
    ,date_dim
where ss_sold_date_sk=d_date_sk
  and d_month_seq between 1193 and 1193+11
  and ss_item_sk is not NULL
group by ss_item_sk, d_date
{code}

Plan 
Notice in the Map join operator the number of rows drop from 82,510,879,939 to 36524 after
the join.
{code}
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 1 <- Map 4 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_item_sk is not null (type: boolean)
                  Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats:
COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_item_sk is not null (type: boolean)
                    Statistics: Num rows: 82510879939 Data size: 652315818272 Basic stats:
COMPLETE Column stats: COMPLETE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
                        1 {d_date_sk} {d_date} {d_month_seq}
                      keys:
                        0 ss_sold_date_sk (type: int)
                        1 d_date_sk (type: int)
                      outputColumnNames: _col1, _col12, _col22, _col26, _col28, _col29
                      input vertices:
                        1 Map 4
                      Statistics: Num rows: 36524 Data size: 4163736 Basic stats: COMPLETE
Column stats: COMPLETE
                      Filter Operator
                        predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 AND 1204) and
_col1 is not null) (type: boolean)
                        Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE
Column stats: COMPLETE
                        Select Operator
                          expressions: _col1 (type: int), _col28 (type: string), _col12 (type:
float)
                          outputColumnNames: _col1, _col28, _col12
                          Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE
Column stats: COMPLETE
                          Group By Operator
                            aggregations: sum(_col12)
                            keys: _col1 (type: int), _col28 (type: string)
                            mode: hash
                            outputColumnNames: _col0, _col1, _col2
                            Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE
Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int), _col1 (type: string)
                              sort order: ++
                              Map-reduce partition columns: _col0 (type: int), _col1 (type:
string)
                              Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE
Column stats: COMPLETE
                              value expressions: _col2 (type: double)
            Execution mode: vectorized
        Map 4
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204)
(type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column
stats: COMPLETE
                  Filter Operator
                    predicate: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204)
(type: boolean)
                    Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column
stats: COMPLETE
                    Reduce Output Operator
                      key expressions: d_date_sk (type: int)
                      sort order: +
                      Map-reduce partition columns: d_date_sk (type: int)
                      Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE
Column stats: COMPLETE
                      value expressions: d_date (type: string), d_month_seq (type: int)
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
                        Dynamic Partitioning Event Operator
                          Target Input: store_sales
                          Partition key expr: ss_sold_date_sk
                          Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
                          Target column: ss_sold_date_sk
                          Target Vertex: Map 1
            Execution mode: vectorized
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0)
                keys: KEY._col0 (type: int), KEY._col1 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column
stats: COMPLETE
                Reduce Output Operator
                  key expressions: _col0 (type: int), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column
stats: COMPLETE
                  value expressions: _col0 (type: int), _col1 (type: string), _col2 (type:
double)
            Execution mode: vectorized
        Reducer 3
            Reduce Operator Tree:
              Extract
                Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column
stats: COMPLETE
                PTF Operator
                  Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column
stats: COMPLETE
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double),
_wcol0 (type: double)
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column
stats: COMPLETE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column
stats: COMPLETE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message