hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8701) Combine nested map joins into the parent map join if possible [Spark Branch]
Date Thu, 20 Nov 2014 02:11:33 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218898#comment-14218898
] 

Szehon Ho commented on HIVE-8701:
---------------------------------

Hi Suhas, sure the plan I see looks like this, for a modified plan of auto_join2 that forces
mapjoin to be in the same operator:

{noformat}
STAGE DEPENDENCIES:
  Stage-3 is a root stage
  Stage-1 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-3
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: src2
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 250 Data size: 2656 Basic stats: COMPLETE Column
stats: NONE
                    Spark HashTable Sink Operator
                      condition expressions:
                        0 {key}
                        1 
                      keys:
                        0 key (type: string)
                        1 key (type: string)
            Local Work:
              Map Reduce Local Work
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: smalltable
                  Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats:
NONE
                  Filter Operator
                    predicate: UDFToDouble(key) is not null (type: boolean)
                    Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
                    Spark HashTable Sink Operator
                      condition expressions:
                        0 {_col0} {_col5}
                        1 {key}
                      keys:
                        0 (_col0 + _col5) (type: double)
                        1 UDFToDouble(key) (type: double)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: src1
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 250 Data size: 2656 Basic stats: COMPLETE Column
stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {key}
                        1 {key}
                      keys:
                        0 key (type: string)
                        1 key (type: string)
                      outputColumnNames: _col0, _col5
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 275 Data size: 2921 Basic stats: COMPLETE Column
stats: NONE
                      Filter Operator
                        predicate: (_col0 + _col5) is not null (type: boolean)
                        Statistics: Num rows: 138 Data size: 1465 Basic stats: COMPLETE Column
stats: NONE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          condition expressions:
                            0 {_col0} {_col5}
                            1 {key}
                          keys:
                            0 (_col0 + _col5) (type: double)
                            1 UDFToDouble(key) (type: double)
                          outputColumnNames: _col0, _col5, _col10
                          input vertices:
                            1 Map 3
                          Statistics: Num rows: 151 Data size: 1611 Basic stats: COMPLETE
Column stats: NONE
                          Select Operator
                            expressions: _col0 (type: string), _col5 (type: string), _col10
(type: string)
                            outputColumnNames: _col0, _col1, _col2
                            Statistics: Num rows: 151 Data size: 1611 Basic stats: COMPLETE
Column stats: NONE
                            File Output Operator
                              compressed: false
                              Statistics: Num rows: 151 Data size: 1611 Basic stats: COMPLETE
Column stats: NONE
                              table:
                                  input format: org.apache.hadoop.mapred.TextInputFormat
                                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{noformat}

The issue is there are two mapjoins in the same work, which is actually good most of time,
but we should make sure we don't overwhelm the executor memory in that case.  Check should
be straight-forward in theory, just to include also size of any parent mapjoin that is directly
connected (ie, no RS or HTS) in the calculation of table size.

> Combine nested map joins into the parent map join if possible [Spark Branch]
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-8701
>                 URL: https://issues.apache.org/jira/browse/HIVE-8701
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Szehon Ho
>
> With the work in HIVE-8616 enabled, the generated plan shows that the nested map join
operator isn't merged to its parent when possible. This is demonstrated in auto_join2.q. The
MR plan shown that this optimization is in place. We should do the same for Spark.
> {code}
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Map 2 <- Map 3 (NONE, 0)
>         Map 3 <- Map 1 (NONE, 0)
>       DagName: xzhang_20141102074141_ac089634-bf01-4386-b1cf-3e7f2e99f6eb:3
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: src2
>                   Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column
stats: NONE
>                   Filter Operator
>                     predicate: key is not null (type: boolean)
>                     Statistics: Num rows: 29 Data size: 2906 Basic stats: COMPLETE Column
stats: NONE
>                     Reduce Output Operator
>                       key expressions: key (type: string)
>                       sort order: +
>                       Map-reduce partition columns: key (type: string)
>                       Statistics: Num rows: 29 Data size: 2906 Basic stats: COMPLETE
Column stats: NONE
>         Map 2 
>             Map Operator Tree:
>                 TableScan
>                   alias: src3
>                   Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column
stats: NONE
>                   Filter Operator
>                     predicate: UDFToDouble(key) is not null (type: boolean)
>                     Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column
stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {_col0}
>                         1 {value}
>                       keys:
>                         0 (_col0 + _col5) (type: double)
>                         1 UDFToDouble(key) (type: double)
>                       outputColumnNames: _col0, _col11
>                       input vertices:
>                         0 Map 3
>                       Statistics: Num rows: 17 Data size: 1813 Basic stats: COMPLETE
Column stats: NONE
>                       Select Operator
>                         expressions: _col0 (type: string), _col11 (type: string)
>                         outputColumnNames: _col0, _col1
>                         Statistics: Num rows: 17 Data size: 1813 Basic stats: COMPLETE
Column stats: NONE
>                         File Output Operator
>                           compressed: false
>                           Statistics: Num rows: 17 Data size: 1813 Basic stats: COMPLETE
Column stats: NONE
>                           table:
>                               input format: org.apache.hadoop.mapred.TextInputFormat
>                               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Map 3 
>             Map Operator Tree:
>                 TableScan
>                   alias: src1
>                   Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column
stats: NONE
>                   Filter Operator
>                     predicate: key is not null (type: boolean)
>                     Statistics: Num rows: 29 Data size: 2906 Basic stats: COMPLETE Column
stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {key}
>                         1 {key}
>                       keys:
>                         0 key (type: string)
>                         1 key (type: string)
>                       outputColumnNames: _col0, _col5
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 31 Data size: 3196 Basic stats: COMPLETE
Column stats: NONE
>                       Filter Operator
>                         predicate: (_col0 + _col5) is not null (type: boolean)
>                         Statistics: Num rows: 16 Data size: 1649 Basic stats: COMPLETE
Column stats: NONE
>                         Reduce Output Operator
>                           key expressions: (_col0 + _col5) (type: double)
>                           sort order: +
>                           Map-reduce partition columns: (_col0 + _col5) (type: double)
>                           Statistics: Num rows: 16 Data size: 1649 Basic stats: COMPLETE
Column stats: NONE
>                           value expressions: _col0 (type: string)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message