Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 6 Dec 2013 04:23:37 +0000 (UTC)
From: "Vikram Dixit K (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12683050.1386301372558.83049.1386303817634@arcas>
In-Reply-To: <JIRA.12683050.1386301372558@arcas>
References: <JIRA.12683050.1386301372558@arcas>
Subject: [jira] [Commented] (HIVE-5973) SMB joins produce incorrect results
 with multiple partitions and buckets
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840956#comment-13840956 ] 

Vikram Dixit K commented on HIVE-5973:
--------------------------------------

The naive fix is to have
{noformat}
output = new Object[eval.length];
try {
      for (; i < eval.length; ++i) {
        output[i] = eval[i].evaluate(row);
      }
}
{noformat}

in the select operator processOp.

However this affects all the other operations as well possibly leading to memory churn. All other approaches I could think of seem cumbersome.

1. Copy the object using the copyToStandardObject method in ObjectInspectorUtils modifies the object itself and requires re-initializing the joinKeys(ExprNodeEvaluators) with the new object inspector. However, this doesn't work with just these changes because we cannot re-initialize an ExprNodeEvaluator with a StandardObjectInspector. It expects a StructObjectInspector which will have to re-worked if we go with this approach.

2. Try to create a new object of the same composition with a shallow copy. However this is not straight-forward either. It requires the struct object inspector to be re-worked to return an object in the same composition as the original.

3. Special case SMB with an if in the select operator to create a new output object. This would hurt vectorization though because it adds an if condition in the tight loop.

4. Create a new select operator for SMB join which extends the current select operator. This could be fixed to have the naive solution above without the memory penalty for the other operations. However, this requires some plan side changes.

I am not sure if I have missed any other way of solving this. [~navis] Could you please provide your comments.

Thanks
Vikram.


> SMB joins produce incorrect results with multiple partitions and buckets
> ------------------------------------------------------------------------
>
>                 Key: HIVE-5973
>                 URL: https://issues.apache.org/jira/browse/HIVE-5973
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.13.0
>            Reporter: Vikram Dixit K
>            Assignee: Vikram Dixit K
>             Fix For: 0.13.0
>
>
> It looks like there is an issue with re-using the output object array in the select operator. When we read rows of the non-big tables, we hold on to the output object in the priority queue. This causes hive to produce incorrect results because all the elements in the priority queue refer to the same object and the join happens on only one of the buckets.
> {noformat}
> output[i] = eval[i].evaluate(row);
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.1#6144)