Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C5FDD100CF for ; Fri, 6 Dec 2013 06:13:54 +0000 (UTC) Received: (qmail 35770 invoked by uid 500); 6 Dec 2013 04:23:41 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 35372 invoked by uid 500); 6 Dec 2013 04:23:38 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 35310 invoked by uid 500); 6 Dec 2013 04:23:38 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 35297 invoked by uid 99); 6 Dec 2013 04:23:37 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Dec 2013 04:23:37 +0000 Date: Fri, 6 Dec 2013 04:23:37 +0000 (UTC) From: "Vikram Dixit K (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-5973) SMB joins produce incorrect results with multiple partitions and buckets MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840956#comment-13840956 ] Vikram Dixit K commented on HIVE-5973: -------------------------------------- The naive fix is to have {noformat} output = new Object[eval.length]; try { for (; i < eval.length; ++i) { output[i] = eval[i].evaluate(row); } } {noformat} in the select operator processOp. However this affects all the other operations as well possibly leading to memory churn. All other approaches I could think of seem cumbersome. 1. Copy the object using the copyToStandardObject method in ObjectInspectorUtils modifies the object itself and requires re-initializing the joinKeys(ExprNodeEvaluators) with the new object inspector. However, this doesn't work with just these changes because we cannot re-initialize an ExprNodeEvaluator with a StandardObjectInspector. It expects a StructObjectInspector which will have to re-worked if we go with this approach. 2. Try to create a new object of the same composition with a shallow copy. However this is not straight-forward either. It requires the struct object inspector to be re-worked to return an object in the same composition as the original. 3. Special case SMB with an if in the select operator to create a new output object. This would hurt vectorization though because it adds an if condition in the tight loop. 4. Create a new select operator for SMB join which extends the current select operator. This could be fixed to have the naive solution above without the memory penalty for the other operations. However, this requires some plan side changes. I am not sure if I have missed any other way of solving this. [~navis] Could you please provide your comments. Thanks Vikram. > SMB joins produce incorrect results with multiple partitions and buckets > ------------------------------------------------------------------------ > > Key: HIVE-5973 > URL: https://issues.apache.org/jira/browse/HIVE-5973 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.13.0 > Reporter: Vikram Dixit K > Assignee: Vikram Dixit K > Fix For: 0.13.0 > > > It looks like there is an issue with re-using the output object array in the select operator. When we read rows of the non-big tables, we hold on to the output object in the priority queue. This causes hive to produce incorrect results because all the elements in the priority queue refer to the same object and the join happens on only one of the buckets. > {noformat} > output[i] = eval[i].evaluate(row); > {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)