hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vikram Dixit K (JIRA)" <>
Subject [jira] [Commented] (HIVE-5973) SMB joins produce incorrect results with multiple partitions and buckets
Date Fri, 06 Dec 2013 04:23:37 GMT


Vikram Dixit K commented on HIVE-5973:

The naive fix is to have
output = new Object[eval.length];
try {
      for (; i < eval.length; ++i) {
        output[i] = eval[i].evaluate(row);

in the select operator processOp.

However this affects all the other operations as well possibly leading to memory churn. All
other approaches I could think of seem cumbersome.

1. Copy the object using the copyToStandardObject method in ObjectInspectorUtils modifies
the object itself and requires re-initializing the joinKeys(ExprNodeEvaluators) with the new
object inspector. However, this doesn't work with just these changes because we cannot re-initialize
an ExprNodeEvaluator with a StandardObjectInspector. It expects a StructObjectInspector which
will have to re-worked if we go with this approach.

2. Try to create a new object of the same composition with a shallow copy. However this is
not straight-forward either. It requires the struct object inspector to be re-worked to return
an object in the same composition as the original.

3. Special case SMB with an if in the select operator to create a new output object. This
would hurt vectorization though because it adds an if condition in the tight loop.

4. Create a new select operator for SMB join which extends the current select operator. This
could be fixed to have the naive solution above without the memory penalty for the other operations.
However, this requires some plan side changes.

I am not sure if I have missed any other way of solving this. [~navis] Could you please provide
your comments.


> SMB joins produce incorrect results with multiple partitions and buckets
> ------------------------------------------------------------------------
>                 Key: HIVE-5973
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.13.0
>            Reporter: Vikram Dixit K
>            Assignee: Vikram Dixit K
>             Fix For: 0.13.0
> It looks like there is an issue with re-using the output object array in the select operator.
When we read rows of the non-big tables, we hold on to the output object in the priority queue.
This causes hive to produce incorrect results because all the elements in the priority queue
refer to the same object and the join happens on only one of the buckets.
> {noformat}
> output[i] = eval[i].evaluate(row);
> {noformat}

This message was sent by Atlassian JIRA

View raw message