Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6858B200BF8 for ; Thu, 29 Dec 2016 20:18:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 6748E160B2D; Thu, 29 Dec 2016 19:18:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9B632160B41 for ; Thu, 29 Dec 2016 20:17:59 +0100 (CET) Received: (qmail 78410 invoked by uid 500); 29 Dec 2016 19:17:58 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 78217 invoked by uid 99); 29 Dec 2016 19:17:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Dec 2016 19:17:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7E1932C2AB6 for ; Thu, 29 Dec 2016 19:17:58 +0000 (UTC) Date: Thu, 29 Dec 2016 19:17:58 +0000 (UTC) From: "Jesus Camacho Rodriguez (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-15493) Wrong result for LEFT outer join in Tez using MapJoinOperator MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 29 Dec 2016 19:18:00 -0000 [ https://issues.apache.org/jira/browse/HIVE-15493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15785916#comment-15785916 ] Jesus Camacho Rodriguez commented on HIVE-15493: ------------------------------------------------ [~pxiong], the basic idea behind that code (HIVE-13191 was just an extension) was that if we have multiple values in one of the join inputs that are equal, we do not need to have them in the RS operator multiple times (and thus shuffle them). Instead, we can have the value only once in the RS operator, and then the join will read _x_ times that same value to produce the correct output. In fact, the code was not working properly originally till HIVE-10582 went in. The problem is that MapJoinOperator does not support duplicate values for left outer join properly: there is an assumption for the row container about the join output columns being the same as the input columns. I have not had the chance to check that code in detail. Till then, this fix will avoid producing incorrect results by not reusing the value in the RS and thus producing it multiple times. I will add a TODO to the code, and as I said in the issue description, I will create a follow-up issue to tackle the root cause of the problem. > Wrong result for LEFT outer join in Tez using MapJoinOperator > ------------------------------------------------------------- > > Key: HIVE-15493 > URL: https://issues.apache.org/jira/browse/HIVE-15493 > Project: Hive > Issue Type: Bug > Affects Versions: 2.2.0 > Reporter: Jesus Camacho Rodriguez > Assignee: Jesus Camacho Rodriguez > Priority: Critical > Attachments: HIVE-15493.01.patch, HIVE-15493.patch > > > To reproduce, we can run in Tez: > {code:sql} > set hive.auto.convert.join=true; > DROP TABLE IF EXISTS test_1; > CREATE TABLE test_1 > ( > member BIGINT > , age VARCHAR (100) > ) > STORED AS TEXTFILE > ; > DROP TABLE IF EXISTS test_2; > CREATE TABLE test_2 > ( > member BIGINT > ) > STORED AS TEXTFILE > ; > INSERT INTO test_1 VALUES (1, '20'), (2, '30'), (3, '40'); > INSERT INTO test_2 VALUES (1), (2), (3); > SELECT > t2.member > , t1.age_1 > , t1.age_2 > FROM > test_2 t2 > LEFT JOIN ( > SELECT > member > , age as age_1 > , age as age_2 > FROM > test_1 > ) t1 > ON t2.member = t1.member > ; > {code} > Result is: > {noformat} > 1 20 NULL > 3 40 NULL > 2 30 NULL > {noformat} > Correct result is: > {noformat} > 1 20 20 > 3 40 40 > 2 30 30 > {noformat} > Bug was introduced by HIVE-10582. Though the fix in HIVE-10582 does not contain tests, it does look legit. In fact, the problem seems to be in the MapJoinOperator itself. It only happens for LEFT outer join (not with RIGHT outer or FULL outer). Although I am still trying to understand part of the MapJoinOperator code path, the bug could be in the initialization of the operator. It only happens when we have duplicate values in the right part of the output. > Till we have more time to study the problem in detail and fix the MapJoinOperator, I will submit a fix that removes the code in SemanticAnalyzer that reuses duplicated value expressions from RS to create multiple columns in the join output (this is equivalent to reverting HIVE-10582). > Once this is pushed, I will create a follow-up issue to take this code back and tackle the problem in the MapJoinOperator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)