From issues-return-198169-archive-asf-public=cust-asf.ponee.io@hive.apache.org  Fri Aug 28 12:19:03 2020
Return-Path: <issues-return-198169-archive-asf-public=cust-asf.ponee.io@hive.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 7848918061A
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 28 Aug 2020 14:19:03 +0200 (CEST)
Received: from mail.apache.org (localhost [127.0.0.1])
	by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id A62751251A1
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 28 Aug 2020 12:19:02 +0000 (UTC)
Received: (qmail 90267 invoked by uid 500); 28 Aug 2020 12:19:02 -0000
Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@hive.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@hive.apache.org>
List-Post: <mailto:issues@hive.apache.org>
List-Id: <issues.hive.apache.org>
Reply-To: dev@hive.apache.org
Delivered-To: mailing list issues@hive.apache.org
Received: (qmail 90258 invoked by uid 99); 28 Aug 2020 12:19:02 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Aug 2020 12:19:02 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 79DCF45EC9
	for <issues@hive.apache.org>; Fri, 28 Aug 2020 12:19:01 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 3BC34780353
	for <issues@hive.apache.org>; Fri, 28 Aug 2020 12:19:00 +0000 (UTC)
Date: Fri, 28 Aug 2020 12:19:00 +0000 (UTC)
From: "ASF GitHub Bot (Jira)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13125273.1513355803000.288780.1598617140244@Atlassian.JIRA>
In-Reply-To: <JIRA.13125273.1513355803000@Atlassian.JIRA>
References: <JIRA.13125273.1513355803000@Atlassian.JIRA> <JIRA.13125273.1513355803430@jira-he-de>
Subject: [jira] [Work logged] (HIVE-18284) NPE when inserting data with
 'distribute by' clause with dynpart sort optimization
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=475813&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475813 ]

ASF GitHub Bot logged work on HIVE-18284:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Aug/20 12:18
            Start Date: 28/Aug/20 12:18
    Worklog Time Spent: 10m 
      Work Description: kgyrtkirk commented on a change in pull request #1400:
URL: https://github.com/apache/hive/pull/1400#discussion_r479216917



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java
##########
@@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, ReduceSinkOperator cRS, ReduceSin
         TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new ArrayList<FieldSchema>(), pRS
             .getConf().getOrder(), pRS.getConf().getNullOrder());
         pRS.getConf().setKeySerializeInfo(keyTable);
+      } else if (cRS.getConf().getKeyCols() != null && cRS.getConf().getKeyCols().size() > 0) {

Review comment:
       don't we need any conditional on `pRS` here?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java
##########
@@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, ReduceSinkOperator cRS, ReduceSin
         TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new ArrayList<FieldSchema>(), pRS
             .getConf().getOrder(), pRS.getConf().getNullOrder());
         pRS.getConf().setKeySerializeInfo(keyTable);
+      } else if (cRS.getConf().getKeyCols() != null && cRS.getConf().getKeyCols().size() > 0) {
+        ArrayList<String> keyColNames = Lists.newArrayList();
+        for (ExprNodeDesc keyCol : pRS.getConf().getKeyCols()) {
+          String keyColName = keyCol.getExprString();
+          keyColNames.add(keyColName);
+        }
+        List<FieldSchema> fields = PlanUtils.getFieldSchemasFromColumnList(pRS.getConf().getKeyCols(),
+            keyColNames, 0, "");
+        TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(fields, pRS.getConf().getOrder(),
+            pRS.getConf().getNullOrder());
+        ArrayList<String> outputKeyCols = Lists.newArrayList();
+        for (int i = 0; i < fields.size(); i++) {
+          outputKeyCols.add(fields.get(i).getName());
+        }
+        pRS.getConf().setOutputKeyColumnNames(outputKeyCols);
+        pRS.getConf().setKeySerializeInfo(keyTable);
+        pRS.getConf().setNumDistributionKeys(cRS.getConf().getNumDistributionKeys());
       }

Review comment:
       I think we should be merging the child into the parent inside this "if" - and we have 2  specific conditionals which are handled - so I think an else false here would be needed - to close down unhandled future cases




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 475813)
    Time Spent: 40m  (was: 0.5h)

> NPE when inserting data with 'distribute by' clause with dynpart sort optimization
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-18284
>                 URL: https://issues.apache.org/jira/browse/HIVE-18284
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 2.3.1, 2.3.2
>            Reporter: Aki Tanaka
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue:
> *(non-vectorized , non-llap mode)*
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.sort.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1513111717879_0056_1_01_000000_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
> 	... 14 more
> Caused by: java.lang.NullPointerException
> 	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
> 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
> 	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
> 	... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)