Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Wed, 27 Nov 2013 02:48:35 +0000 (UTC)
From: "Sun Rui (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12681234.1385460645376.35761.1385520515327@arcas>
In-Reply-To: <JIRA.12681234.1385460645376@arcas>
References: <JIRA.12681234.1385460645376@arcas>
Subject: [jira] [Commented] (HIVE-5891) Alias conflict when merging multiple
 mapjoin tasks into their common child mapred task
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833386#comment-13833386 ] 

Sun Rui commented on HIVE-5891:
-------------------------------

Yin Huai  The following is the explain output for the merged task:
{panel}
  Stage: Stage-4
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery1:x-subquery1:$INTNAME 
            Map Join Operator
              condition map:
                   Inner Join 0 to 1
              condition expressions:
                0 
                1 {key}
              handleSkewJoin: false
              keys:
                0 [Column[_col0]]
                1 [Column[key]]
              outputColumnNames: _col1
              Position of Big Table: 0
              Select Operator
                expressions:
                      expr: _col1
                      type: int
                outputColumnNames: _col0
                Union
                  Select Operator
                    expressions:
                          expr: _col0
                          type: int
                    outputColumnNames: _col0
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        null-subquery2:x-subquery2:$INTNAME 
            Map Join Operator
              condition map:
                   Inner Join 0 to 1
              condition expressions:
                0 
                1 {key}
              handleSkewJoin: false
              keys:
                0 [Column[_col0]]
                1 [Column[key]]
              outputColumnNames: _col1
              Position of Big Table: 0
              Select Operator
                expressions:
                      expr: _col1
                      type: int
                outputColumnNames: _col0
                Union
                  Select Operator
                    expressions:
                          expr: _col0
                          type: int
                    outputColumnNames: _col0
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      Local Work:
        Map Reduce Local Work
{panel}

As you can see, MapJoins and the Union are to be executed in the same job. The aliases are changed from "$INTNAME" to "null-subquery1:x-subquery1:$INTNAME" and "null-subquery2:x-subquery2:$INTNAME" respectively.

I am thinking maybe we can change the intermediate alias name for a join stream from "$INTNAME" to "$JOIN_INTERMEDIATE". But which better name do you think for the case of DemuxOperator?

Ashutosh Chauhan : I will add the test case as a .q file when we finalize the patch:)

> Alias conflict when merging multiple mapjoin tasks into their common child mapred task
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-5891
>                 URL: https://issues.apache.org/jira/browse/HIVE-5891
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>         Attachments: HIVE-5891.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {quote}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
>     (select a.key from src a join src b on a.key=b.key group by a.key) tmp
>     join src c on tmp.key=c.key
>   union all
>   select c.key from
>     (select a.key from src a join src b on a.key=b.key group by a.key) tmp
>     join src c on tmp.key=c.key
> ) x;
> {quote}
> We will get a NullPointerException from Union Operator:
> {quote}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"_col0":0}
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"_col0":0}
> 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
> 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
> 	... 4 more
> Caused by: java.lang.NullPointerException
> 	at org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
> 	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
> 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
> 	at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
> 	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
> 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
> 	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
> 	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
> 	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
> 	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
> 	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
> 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
> 	at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
> 	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
> 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
> 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
> 	... 5 more
> {quote}
>   
> The root cause is in CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
>   +--------------+      +--------------+
>   | MapJoin task |      | MapJoin task |
>   +--------------+      +--------------+
>              \             /
>               \           /
>              +--------------+
>              |  Union task  |
>              +--------------+
>              
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their common child: Union task. The two MapJoin tasks have the same alias name for their big tables: $INTNAME, which is the name of the temporary table of a join stream. The aliasToWork map uses alias as key, so eventually only the MapJoin operator tree of one MapJoin task is saved into the aliasToWork map of the Union task, while the MapJoin operator tree of another MapJoin task is lost. As a result, Union operator won't be initialized because not all of its parents gets intialized (The Union operator itself indicates it has two parents, but actually it has only 1 parent because another parent is lost).
> This issue does not exist in HIVE 0.11 and thus is a regression bug in HIVE 0.12.
> The propsed solution is to use the query ID as prefix for the join stream name to avoid conflict and add sanity check code in CommonJoinTaskDispatcher that merge of a MapJoin task into its child MapRed task is skipped if there is any alias conflict. Please review the patch. I am not sure if the patch properly handles the case of DemuxOperator.
> BTW, anyone knows the origin of "$INTNAME"? it is so confusing, maybe we can replace it with a meaningful name.


--
This message was sent by Atlassian JIRA
(v6.1#6144)