hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rich Haase (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11132) Queries using join and group by produce incorrect output when hive.auto.convert.join=false and hive.optimize.reducededuplication=true
Date Fri, 26 Jun 2015 22:29:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603705#comment-14603705
] 

Rich Haase commented on HIVE-11132:
-----------------------------------

Explain plan when hive.auto.convert.join=false and hive.optimize.reducededuplication=true:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: mooo
            Statistics: Num rows: 1511511 Data size: 3402058087 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: (((oppty_id is not null and oppty_line_id is not null) and (order_order_system
<> 'sfdc_performance')) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column
stats: NONE
          TableScan
            alias: mooo_s
            Statistics: Num rows: 1511511 Data size: 940228122 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id
= '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column
stats: NONE
          TableScan
            alias: forecast
            Statistics: Num rows: 29923099 Data size: 7723657280 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id
= '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column
stats: NONE
          TableScan
            alias: split
            Statistics: Num rows: 2072636 Data size: 524862652 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id
= '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column
stats: NONE
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
               Inner Join 0 to 2
               Inner Join 0 to 3
          condition expressions:
            0
            1
            2
            3
          Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column
stats: NONE
          Select Operator
            expressions: '0064000000CZbnWAAT' (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column
stats: NONE
            Group By Operator
              aggregations: count()
              keys: _col0 (type: string)
              mode: complete
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column
stats: NONE
              Select Operator
                expressions: _col0 (type: string), _col1 (type: bigint)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: true
                  Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE
Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Explain plan when hive.auto.convert.join=true and hive.optimize.reducededuplication=true

STAGE DEPENDENCIES:
  Stage-10 is a root stage , consists of Stage-1
  Stage-1
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-10
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: mooo
            Statistics: Num rows: 1511511 Data size: 3402058087 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: (((oppty_id is not null and oppty_line_id is not null) and (order_order_system
<> 'sfdc_performance')) and (oppty_id = '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 188939 Data size: 425257542 Basic stats: COMPLETE Column
stats: NONE
          TableScan
            alias: mooo_s
            Statistics: Num rows: 1511511 Data size: 940228122 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id
= '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 188939 Data size: 117528593 Basic stats: COMPLETE Column
stats: NONE
          TableScan
            alias: forecast
            Statistics: Num rows: 29923099 Data size: 7723657280 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id
= '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 3740387 Data size: 965457063 Basic stats: COMPLETE Column
stats: NONE
          TableScan
            alias: split
            Statistics: Num rows: 2072636 Data size: 524862652 Basic stats: COMPLETE Column
stats: NONE
            Filter Operator
              predicate: ((oppty_id is not null and oppty_line_id is not null) and (oppty_id
= '0064000000CZbnWAAT')) (type: boolean)
              Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column
stats: NONE
              Reduce Output Operator
                key expressions: '0064000000CZbnWAAT' (type: string), oppty_line_id (type:
string)
                sort order: ++
                Map-reduce partition columns: '0064000000CZbnWAAT' (type: string), oppty_line_id
(type: string)
                Statistics: Num rows: 259079 Data size: 65607704 Basic stats: COMPLETE Column
stats: NONE
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
               Inner Join 0 to 2
               Inner Join 0 to 3
          condition expressions:
            0
            1
            2
            3
          Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column
stats: NONE
          Select Operator
            expressions: '0064000000CZbnWAAT' (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column
stats: NONE
            Group By Operator
              aggregations: count()
              keys: _col0 (type: string)
              mode: hash
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column
stats: NONE
              File Output Operator
                compressed: true
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string)
              sort order: +
              Map-reduce partition columns: _col0 (type: string)
              Statistics: Num rows: 12343277 Data size: 3186008376 Basic stats: COMPLETE Column
stats: NONE
              value expressions: _col1 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column
stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column
stats: NONE
            File Output Operator
              compressed: true
              Statistics: Num rows: 6171638 Data size: 1593004058 Basic stats: COMPLETE Column
stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

> Queries using join and group by produce incorrect output when hive.auto.convert.join=false
and hive.optimize.reducededuplication=true
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-11132
>                 URL: https://issues.apache.org/jira/browse/HIVE-11132
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>            Reporter: Rich Haase
>
> Queries using join and group by produce multiple output rows with the same key when hive.auto.convert.join=false
and hive.optimize.reducededuplication=true.  This interaction between configuration parameters
is unexpected and should be well documented at the very least and should likely be considered
a bug.
> e.g. 
> hive> set hive.auto.convert.join = false;
> hive> set hive.optimize.reducededuplication = true;
> hive> SELECT foo.id, count(*) as factor
>     > FROM foo
>     > JOIN bar ON (foo.id = bar.id and foo.line_id = bar.line_id)
>     > JOIN split ON (foo.id = split.id and foo.line_id = split.line_id)
>     > JOIN forecast ON (foo.id = forecast.id AND foo.line_id = forecast.line_id)
>     > WHERE foo.order != ‘blah’ AND foo.id = ‘XYZ'
>     > GROUP BY foo.id;
> XYZ         79
> XYZ		74
> XYZ		297
> XYZ		66
> hive> set hive.auto.convert.join = true;
> hive> set hive.optimize.reducededuplication = true;
> hive> SELECT foo.id, count(*) as factor
>     > FROM foo
>     > JOIN bar ON (foo.id = bar.id and foo.line_id = bar.line_id)
>     > JOIN split ON (foo.id = split.id and foo.line_id = split.line_id)
>     > JOIN forecast ON (foo.id = forecast.id AND foo.line_id = forecast.line_id)
>     > WHERE foo.order != ‘blah’ AND foo.id = ‘XYZ'
>     > GROUP BY foo.id;
> XYZ         516



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message