hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-8517) When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression
Date Tue, 21 Oct 2014 17:52:34 GMT

     [ https://issues.apache.org/jira/browse/HIVE-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mostafa Mokhtar updated HIVE-8517:
----------------------------------
    Status: Patch Available  (was: In Progress)

New patch which calculates the avg col length which was missing from the partitioned column
statistics.

> When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8517
>                 URL: https://issues.apache.org/jira/browse/HIVE-8517
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 0.14.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Mostafa Mokhtar
>            Priority: Critical
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8517.1.patch, HIVE-8517.2.patch, HIVE-8517.3.patch
>
>
> When joining on partition column number of partitions is used as NDV which gets overridden
by StatsUtils.getColStatisticsFromExpression and the number of partitions used as NDV is replaced
by number of rows which results in the same behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196.
"Joining on partition columns with fetch column stats enabled results it very small CE which
negatively affects query performance "
> This is the call stack.
> {code}
> StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) line: 1001

> StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, NodeProcessorCtx,
Object...) line: 1479	
> DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90	
> PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94

> PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78	
> PreOrderWalker.walk(Node) line: 54	
> PreOrderWalker.walk(Node) line: 59	
> PreOrderWalker.walk(Node) line: 59	
> PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>)
line: 109	
> AnnotateWithStatistics.transform(ParseContext) line: 78	
> TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248	
> TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, Set<WriteEntity>)
line: 120	
> TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>,
HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99	
> SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037	
> SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
> ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74	
> ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221	
> Driver.compile(String, boolean) line: 415	
> {code}
> Query
> {code}
> select
>   ss_item_sk item_sk, d_date, sum(ss_sales_price),
>   sum(sum(ss_sales_price))
>       over (partition by ss_item_sk order by d_date rows between unbounded preceding
and current row) cume_sales
> from store_sales
>     ,date_dim
> where ss_sold_date_sk=d_date_sk
>   and d_month_seq between 1193 and 1193+11
>   and ss_item_sk is not NULL
> group by ss_item_sk, d_date
> {code}
> Plan 
> Notice in the Map join operator the number of rows drop from 82,510,879,939 to 36524
after the join.
> {code}
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Map 1 <- Map 4 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>       DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: ss_item_sk is not null (type: boolean)
>                   Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ss_item_sk is not null (type: boolean)
>                     Statistics: Num rows: 82510879939 Data size: 652315818272 Basic stats:
COMPLETE Column stats: COMPLETE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
>                         1 {d_date_sk} {d_date} {d_month_seq}
>                       keys:
>                         0 ss_sold_date_sk (type: int)
>                         1 d_date_sk (type: int)
>                       outputColumnNames: _col1, _col12, _col22, _col26, _col28, _col29
>                       input vertices:
>                         1 Map 4
>                       Statistics: Num rows: 36524 Data size: 4163736 Basic stats: COMPLETE
Column stats: COMPLETE
>                       Filter Operator
>                         predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 AND 1204)
and _col1 is not null) (type: boolean)
>                         Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE
Column stats: COMPLETE
>                         Select Operator
>                           expressions: _col1 (type: int), _col28 (type: string), _col12
(type: float)
>                           outputColumnNames: _col1, _col28, _col12
>                           Statistics: Num rows: 9131 Data size: 1040934 Basic stats:
COMPLETE Column stats: COMPLETE
>                           Group By Operator
>                             aggregations: sum(_col12)
>                             keys: _col1 (type: int), _col28 (type: string)
>                             mode: hash
>                             outputColumnNames: _col0, _col1, _col2
>                             Statistics: Num rows: 4565 Data size: 483890 Basic stats:
COMPLETE Column stats: COMPLETE
>                             Reduce Output Operator
>                               key expressions: _col0 (type: int), _col1 (type: string)
>                               sort order: ++
>                               Map-reduce partition columns: _col0 (type: int), _col1
(type: string)
>                               Statistics: Num rows: 4565 Data size: 483890 Basic stats:
COMPLETE Column stats: COMPLETE
>                               value expressions: _col2 (type: double)
>             Execution mode: vectorized
>         Map 4
>             Map Operator Tree:
>                 TableScan
>                   alias: date_dim
>                   filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND
1204) (type: boolean)
>                   Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE
Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND
1204) (type: boolean)
>                     Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE
Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: d_date_sk (type: int)
>                       sort order: +
>                       Map-reduce partition columns: d_date_sk (type: int)
>                       Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE
Column stats: COMPLETE
>                       value expressions: d_date (type: string), d_month_seq (type: int)
>                     Select Operator
>                       expressions: d_date_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column
stats: COMPLETE
>                       Group By Operator
>                         keys: _col0 (type: int)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL
Column stats: COMPLETE
>                         Dynamic Partitioning Event Operator
>                           Target Input: store_sales
>                           Partition key expr: ss_sold_date_sk
>                           Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL
Column stats: COMPLETE
>                           Target column: ss_sold_date_sk
>                           Target Vertex: Map 1
>             Execution mode: vectorized
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: sum(VALUE._col0)
>                 keys: KEY._col0 (type: int), KEY._col1 (type: string)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column
stats: COMPLETE
>                 Reduce Output Operator
>                   key expressions: _col0 (type: int), _col1 (type: string)
>                   sort order: ++
>                   Map-reduce partition columns: _col0 (type: int)
>                   Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE
Column stats: COMPLETE
>                   value expressions: _col0 (type: int), _col1 (type: string), _col2 (type:
double)
>             Execution mode: vectorized
>         Reducer 3
>             Reduce Operator Tree:
>               Extract
>                 Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column
stats: COMPLETE
>                 PTF Operator
>                   Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE
Column stats: COMPLETE
>                   Select Operator
>                     expressions: _col0 (type: int), _col1 (type: string), _col2 (type:
double), _wcol0 (type: double)
>                     outputColumnNames: _col0, _col1, _col2, _col3
>                     Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE
Column stats: COMPLETE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE
Column stats: COMPLETE
>                       table:
>                           input format: org.apache.hadoop.mapred.TextInputFormat
>                           output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message