hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-10194) CBO (Calcite Return Path): Q94 generates cross product
Date Thu, 02 Apr 2015 17:56:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393043#comment-14393043
] 

Mostafa Mokhtar commented on HIVE-10194:
----------------------------------------

[~jcamachorodriguez] [~jpullokkaran]
The query can be simplified to an equi join followed by theta join :

{code}
SELECT 
    ws2.ws_order_number as ws_order_number
FROM
    web_sales ws2
        JOIN
    web_sales ws3 ON (ws2.ws_order_number = ws3.ws_order_number)
        JOIN
    date_dim d ON (ws2.ws_ship_date_sk = d.d_date_sk)
WHERE
    ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk
{code}


> CBO (Calcite Return Path): Q94 generates cross product 
> -------------------------------------------------------
>
>                 Key: HIVE-10194
>                 URL: https://issues.apache.org/jira/browse/HIVE-10194
>             Project: Hive
>          Issue Type: Sub-task
>          Components: CBO
>            Reporter: Mostafa Mokhtar
>            Assignee: Jesus Camacho Rodriguez
>             Fix For: 1.2.0
>
>
> Query 
> {code}
> SELECT count(distinct ws_order_number) as order_count,
>                sum(ws_ext_ship_cost) as total_shipping_cost,
>                sum(ws_net_profit) as total_net_profit
> FROM web_sales ws1
> JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)
> JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)
> JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)
> LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number
>                                FROM web_sales ws2 JOIN web_sales ws3
>                                ON (ws2.ws_order_number = ws3.ws_order_number)
>                                WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk
> 			) ws_wh1
> ON (ws1.ws_order_number = ws_wh1.ws_order_number)
> LEFT OUTER JOIN web_returns wr1 ON (ws1.ws_order_number = wr1.wr_order_number)
> WHERE d.d_date between '1999-05-01' and '1999-07-01' and
>                ca.ca_state = 'TX' and
>                s.web_company_name = 'pri' and
>                wr1.wr_order_number is null
> limit 100
> {code}
> Plan
> {code}
> OK
> Time taken: 0.23 seconds
> Warning: Map Join MAPJOIN[83][bigTable=ws1] in task 'Map 2' is a cross product
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Map 2 <- Map 1 (BROADCAST_EDGE)
>         Map 8 <- Reducer 4 (BROADCAST_EDGE)
>         Reducer 3 <- Map 2 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE),
Map 7 (SIMPLE_EDGE)
>         Reducer 4 <- Map 10 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE)
>         Reducer 9 <- Map 8 (SIMPLE_EDGE)
>       DagName: mmokhtar_20150402132417_1bc8688b-59a0-4909-82a4-b9d386065bbd:3
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: ws1
>                   filterExpr: (((ws_ship_addr_sk = ws_order_number) and (ws_ship_date_sk
<> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: boolean)
>                   Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (((ws_ship_addr_sk = ws_order_number) and (ws_ship_date_sk
<> ws_web_site_sk)) and ws_ship_addr_sk is not null) (type: boolean)
>                     Statistics: Num rows: 71974471 Data size: 1151483592 Basic stats:
COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ws_ship_addr_sk (type: int)
>                       outputColumnNames: _col1
>                       Statistics: Num rows: 71974471 Data size: 287862044 Basic stats:
COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         sort order:
>                         Statistics: Num rows: 71974471 Data size: 287862044 Basic stats:
COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: int)
>             Execution mode: vectorized
>         Map 10
>             Map Operator Tree:
>                 TableScan
>                   alias: wr1
>                   Statistics: Num rows: 13749816 Data size: 2585240312 Basic stats: COMPLETE
Column stats: COMPLETE
>                   Reduce Output Operator
>                     key expressions: wr_order_number (type: int)
>                     sort order: +
>                     Map-reduce partition columns: wr_order_number (type: int)
>                     Statistics: Num rows: 13749816 Data size: 2585240312 Basic stats:
COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: ws1
>                   Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0
>                       1
>                     outputColumnNames: _col1
>                     input vertices:
>                       0 Map 1
>                     Statistics: Num rows: 5180969438964472 Data size: 20723877755857888
Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col1 (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5180969438964472 Data size: 20723877755857888
Basic stats: COMPLETE Column stats: COMPLETE
>                       Group By Operator
>                         keys: _col0 (type: int)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 73333928460636 Data size: 293335713842544
Basic stats: COMPLETE Column stats: COMPLETE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: int)
>                           sort order: +
>                           Map-reduce partition columns: _col0 (type: int)
>                           Statistics: Num rows: 73333928460636 Data size: 293335713842544
Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 5
>             Map Operator Tree:
>                 TableScan
>                   alias: ca
>                   filterExpr: ((ca_state = 'TX') and ca_address_sk is not null) (type:
boolean)
>                   Statistics: Num rows: 800000 Data size: 811903688 Basic stats: COMPLETE
Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((ca_state = 'TX') and ca_address_sk is not null) (type:
boolean)
>                     Statistics: Num rows: 15686 Data size: 1411740 Basic stats: COMPLETE
Column stats: COMPLETE
>                     Select Operator
>                       expressions: ca_address_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 15686 Data size: 62744 Basic stats: COMPLETE
Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 15686 Data size: 62744 Basic stats: COMPLETE
Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 6
>             Map Operator Tree:
>                 TableScan
>                   alias: d
>                   filterExpr: (d_date BETWEEN '1999-05-01' AND '1999-07-01' and d_date_sk
is not null) (type: boolean)
>                   Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE
Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (d_date BETWEEN '1999-05-01' AND '1999-07-01' and d_date_sk
is not null) (type: boolean)
>                     Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE
Column stats: COMPLETE
>                     Select Operator
>                       expressions: d_date_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE
Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE
Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 7
>             Map Operator Tree:
>                 TableScan
>                   alias: ws1
>                   filterExpr: (((ws_order_number is not null and ws_ship_addr_sk is not
null) and ws_ship_date_sk is not null) and ws_web_site_sk is not null) (type: boolean)
>                   Statistics: Num rows: 143966864 Data size: 33110363004 Basic stats:
COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (((ws_order_number is not null and ws_ship_addr_sk is
not null) and ws_ship_date_sk is not null) and ws_web_site_sk is not null) (type: boolean)
>                     Statistics: Num rows: 143912892 Data size: 3453621540 Basic stats:
COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: ws_order_number (type: int)
>                       sort order: +
>                       Map-reduce partition columns: ws_order_number (type: int)
>                       Statistics: Num rows: 143912892 Data size: 3453621540 Basic stats:
COMPLETE Column stats: COMPLETE
>                       value expressions: ws_ship_date_sk (type: int), ws_ship_addr_sk
(type: int), ws_web_site_sk (type: int), ws_ext_ship_cost (type: float), ws_net_profit (type:
float)
>             Execution mode: vectorized
>         Map 8
>             Map Operator Tree:
>                 TableScan
>                   alias: s
>                   filterExpr: ((web_company_name = 'pri') and web_site_sk is not null)
(type: boolean)
>                   Statistics: Num rows: 38 Data size: 70614 Basic stats: COMPLETE Column
stats: COMPLETE
>                   Filter Operator
>                     predicate: ((web_company_name = 'pri') and web_site_sk is not null)
(type: boolean)
>                     Statistics: Num rows: 5 Data size: 460 Basic stats: COMPLETE Column
stats: COMPLETE
>                     Select Operator
>                       expressions: web_site_sk (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5 Data size: 20 Basic stats: COMPLETE Column
stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col30 (type: int)
>                           1 _col0 (type: int)
>                         outputColumnNames: _col31, _col32, _col33
>                         input vertices:
>                           0 Reducer 4
>                         Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE Column
stats: NONE
>                         Select Operator
>                           expressions: _col31 (type: int), _col32 (type: float), _col33
(type: float)
>                           outputColumnNames: _col0, _col1, _col2
>                           Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE
Column stats: NONE
>                           Group By Operator
>                             aggregations: count(DISTINCT _col0), sum(_col1), sum(_col2)
>                             keys: _col0 (type: int)
>                             mode: hash
>                             outputColumnNames: _col0, _col1, _col2, _col3
>                             Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE
Column stats: NONE
>                             Reduce Output Operator
>                               key expressions: _col0 (type: int)
>                               sort order: +
>                               Statistics: Num rows: 5 Data size: 22 Basic stats: COMPLETE
Column stats: NONE
>                               TopN Hash Memory Usage: 0.04
>                               value expressions: _col2 (type: double), _col3 (type: double)
>             Execution mode: vectorized
>         Reducer 3
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Left Semi Join 0 to 1
>                 keys:
>                   0 ws_order_number (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col2, _col11, _col13, _col17, _col28, _col33
>                 Statistics: Num rows: 718857633877870 Data size: 17252583213068880 Basic
stats: COMPLETE Column stats: COMPLETE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                   keys:
>                     0 _col11 (type: int)
>                     1 _col0 (type: int)
>                   outputColumnNames: _col2, _col13, _col17, _col28, _col33
>                   input vertices:
>                     1 Map 5
>                   Statistics: Num rows: 14094999486464 Data size: 281899989729280 Basic
stats: COMPLETE Column stats: COMPLETE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0 _col2 (type: int)
>                       1 _col0 (type: int)
>                     outputColumnNames: _col13, _col17, _col28, _col33
>                     input vertices:
>                       1 Map 6
>                     Statistics: Num rows: 7047403274240 Data size: 112758452387840 Basic
stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col13 (type: int), _col17 (type: int), _col28 (type:
float), _col33 (type: float)
>                       outputColumnNames: _col2, _col3, _col4, _col5
>                       Statistics: Num rows: 7047403274240 Data size: 112758452387840
Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col3 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col3 (type: int)
>                         Statistics: Num rows: 7047403274240 Data size: 112758452387840
Basic stats: COMPLETE Column stats: COMPLETE
>                         value expressions: _col2 (type: int), _col4 (type: float), _col5
(type: float)
>         Reducer 4
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Right Outer Join0 to 1
>                 keys:
>                   0 wr_order_number (type: int)
>                   1 _col3 (type: int)
>                 outputColumnNames: _col13, _col30, _col31, _col32, _col33
>                 Statistics: Num rows: 12100482980189 Data size: 242009659603780 Basic
stats: COMPLETE Column stats: COMPLETE
>                 Filter Operator
>                   predicate: _col13 is null (type: boolean)
>                   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats:
COMPLETE
>                   Reduce Output Operator
>                     key expressions: _col30 (type: int)
>                     sort order: +
>                     Map-reduce partition columns: _col30 (type: int)
>                     Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats:
COMPLETE
>                     value expressions: _col31 (type: int), _col32 (type: float), _col33
(type: float)
>         Reducer 9
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(DISTINCT KEY._col0:0._col0), sum(VALUE._col1), sum(VALUE._col2)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats:
NONE
>                 Limit
>                   Number of rows: 100
>                   Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column
stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column
stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: 100
>       Processor Tree:
>         ListSink
> {code}
> Logical plan 
> {code}
> HiveSort(fetch=[100]): rowcount = 1.0, cumulative cost = {4.594004456323317E8 rows, 5.882012320482085E9
cpu, 9.353802456E12 io}, id = 1080
>   HiveAggregate(group=[{}], agg#0=[count(DISTINCT $0)], agg#1=[sum($1)], agg#2=[sum($2)]):
rowcount = 1.0, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12
io}, id = 1078
>     HiveProject($f0=[$3], $f1=[$4], $f2=[$5]): rowcount = 464.4485504122314, cumulative
cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1076
>       HiveFilter(condition=[isnull($12)]): rowcount = 464.4485504122314, cumulative cost
= {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io}, id = 1074
>         HiveProject(ws_ship_date_sk=[$1], ws_ship_addr_sk=[$2], ws_web_site_sk=[$3],
ws_order_number=[$4], ws_ext_ship_cost=[$5], ws_net_profit=[$6], ca_address_sk=[$7], ca_state=[$8],
web_site_sk=[$11], web_company_name=[$12], d_date_sk=[$9], d_date=[$10], wr_order_number=[$0]):
rowcount = 1229395.3129411766, cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9
cpu, 9.353802456E12 io}, id = 1185
>           HiveJoin(condition=[=($3, $11)], joinType=[inner], joinAlgorithm=[map_join],
cost=[{1.3749818235294119E7 rows, 1.3749820470588237E7 cpu, 0.0 io}]): rowcount = 1229395.3129411766,
cumulative cost = {4.594004336323317E8 rows, 5.882012319482085E9 cpu, 9.353802456E12 io},
id = 1183
>             HiveJoin(condition=[=($4, $0)], joinType=[right], joinAlgorithm=[map_join],
cost=[{1.3749817E7 rows, 2.7499633E7 cpu, 2.47496688E10 io}]): rowcount = 1.3749816E7, cumulative
cost = {4.456506153970376E8 rows, 5.868262499011497E9 cpu, 9.353802456E12 io}, id = 1181
>               HiveTableScan(table=[[tpcds_bin_orc_200.web_returns]]): rowcount = 1.3749816E7,
cumulative cost = {0}, id = 974
>               HiveJoin(condition=[=($0, $8)], joinType=[inner], joinAlgorithm=[map_join],
cost=[{204.39703763146147 rows, 205.39703763146147 cpu, 0.0 io}]): rowcount = 1.0, cumulative
cost = {4.319007983970376E8 rows, 5.840762866011497E9 cpu, 9.3290527872E12 io}, id = 1179
>                 HiveJoin(condition=[=($1, $6)], joinType=[inner], joinAlgorithm=[map_join],
cost=[{1.43966865E8 rows, 1.43966866E8 cpu, 0.0 io}]): rowcount = 203.39703763146147, cumulative
cost = {4.31900594E8 rows, 5.840762660614459E9 cpu, 9.3290527872E12 io}, id = 1110
>                   SemiJoin(condition=[=($3, $6)], joinType=[inner]): rowcount = 1.43966864E8,
cumulative cost = {2.87933729E8 rows, 5.696795794614459E9 cpu, 9.3290527872E12 io}, id = 1058
>                     HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount =
1.43966864E8, cumulative cost = {0}, id = 958
>                     HiveProject(ws_order_number=[$1]): rowcount = 1.655760644524185E11,
cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1056
>                       HiveFilter(condition=[<>($0, $2)]): rowcount = 1.655760644524185E11,
cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1054
>                         HiveJoin(condition=[=($1, $3)], joinType=[inner], joinAlgorithm=[common_join],
cost=[{2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}]): rowcount = 1.6564734127740878E11,
cumulative cost = {2.87933728E8 rows, 5.696795793614459E9 cpu, 9.3290527872E12 io}, id = 1103
>                           HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount
= 1.43966864E8, cumulative cost = {0}, id = 958
>                           HiveTableScan(table=[[tpcds_bin_orc_200.web_sales]]): rowcount
= 1.43966864E8, cumulative cost = {0}, id = 958
>                   HiveFilter(condition=[=($1, 'TX')]): rowcount = 1.0, cumulative cost
= {0.0 rows, 0.0 cpu, 0.0 io}, id = 1060
>                     HiveTableScan(table=[[tpcds_bin_orc_200.customer_address]]): rowcount
= 800000.0, cumulative cost = {0}, id = 965
>                 HiveFilter(condition=[between(false, $1, '1999-05-01', '1999-07-01')]):
rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1068
>                   HiveTableScan(table=[[tpcds_bin_orc_200.date_dim]]): rowcount = 73049.0,
cumulative cost = {0}, id = 971
>             HiveFilter(condition=[=($1, 'pri')]): rowcount = 2.235294117647059, cumulative
cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1064
>               HiveTableScan(table=[[tpcds_bin_orc_200.web_site]]): rowcount = 38.0, cumulative
cost = {0}, id = 968
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message