hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amareshwari Sriramadasu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-741) NULL is not handled correctly in join
Date Tue, 17 Aug 2010 10:15:17 GMT

    [ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899365#action_12899365

Amareshwari Sriramadasu commented on HIVE-741:

Thanks Ning for your comments.

bq.  select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value;
bq.  select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value;
This is happening because I'm assuming nr.get(0) in JoinOperator is the join-key. It seems
it not always true that key is the first element in the ArrayList. When I modified a the code
to the following, above queries are giving correct results.
      StructObjectInspector soi = (StructObjectInspector) inputObjInspectors[tag];
      StructField sf = soi.getStructFieldRef(Utilities.ReduceField.KEY
      Object keyObject = soi.getStructFieldData(row, sf);
      if (SerDeUtils.isNullObject(keyObject, soi)) {
Added method SerDeUtils.isNullObject(keyObject, soi) to know if the object passed is representing
a NULL object.

bq. select * FROM myinput1 a left outer JOIN myinput1 b right outer join myinput1 c ON a.value
= b.value and b.value = c.value;
Looking at Stage-1 of "explain" for the above query:
Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
            alias: a
            Reduce Output Operator
              sort order:
              tag: 0
              value expressions:
                    expr: key
                    type: int
                    expr: value
                    type: int
            alias: b
            Reduce Output Operator
              sort order:
              tag: 1
              value expressions:
                    expr: key
                    type: int
                    expr: value
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4, _col5
          Filter Operator
                expr: (_col1 = _col5)
                type: boolean
            File Output Operator
              compressed: false
              GlobalTableId: 0
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Join happens without join key?  Here, join output is the Cartesian product of a and b on which
FilterOperator is applied, Am I right? I see the semantics of inner/outer join on two tables
without join condition is to produce Cartesian product. As a side note: "MySql does not allow
outer joins without join condition". 
If Join is allowed without join condition to produce Cartesian product of the two tables,
then my patch should be changed to consider if join-key is defined for the join or not.  I
could reproduce it by simple query "select * FROM myinput1 a JOIN myinput1 b". I think the
same applies to MapJoin as well. 

bq. Verified that SMBMapJoinOperator already filters nulls properly. 
bq. Can you also add one or few tests for sort merge join? 
It seems my verification was wrong here, I thought if the table is sorted and hive.optimize.bucketmapjoin,
hive.optimize.bucketmapjoin.sortedmerge are set to true, MapJoin uses SMBMapJoinOperator.
But it was using MapJoinOperator it self. When I created a table with "sorted by" column,
I see it using SMBMapJoinOperator. Currently if there are any nulls in the input table, SMBJoin
fails with NullPointerException:
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.io.IntWritable.compareTo(IntWritable.java:60)
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115)
	at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.compareKeys(SMBMapJoinOperator.java:389)
	at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processKey(SMBMapJoinOperator.java:438)
	at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processOp(SMBMapJoinOperator.java:205)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:698)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
	at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.fetchOneRow(SMBMapJoinOperator.java:479)
	... 17 more
Will look into this. 

bq. For inner, left and right outer joins, a simpler fix would be to add a filter on top.

Now, I agree it would be simpler :). Will consider this also and see if i can do some special
handling for full outer joins.

> NULL is not handled correctly in join
> -------------------------------------
>                 Key: HIVE-741
>                 URL: https://issues.apache.org/jira/browse/HIVE-741
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Amareshwari Sriramadasu
>         Attachments: patch-741.txt
> With the following data in table input4_cb:
> Key        Value
> ------       --------
> NULL     325
> 18          NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL    325    18   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message