From issues-return-198169-archive-asf-public=cust-asf.ponee.io@hive.apache.org Fri Aug 28 12:19:03 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 7848918061A for ; Fri, 28 Aug 2020 14:19:03 +0200 (CEST) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id A62751251A1 for ; Fri, 28 Aug 2020 12:19:02 +0000 (UTC) Received: (qmail 90267 invoked by uid 500); 28 Aug 2020 12:19:02 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 90258 invoked by uid 99); 28 Aug 2020 12:19:02 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Aug 2020 12:19:02 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 79DCF45EC9 for ; Fri, 28 Aug 2020 12:19:01 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 3BC34780353 for ; Fri, 28 Aug 2020 12:19:00 +0000 (UTC) Date: Fri, 28 Aug 2020 12:19:00 +0000 (UTC) From: "ASF GitHub Bot (Jira)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=475813&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475813 ] ASF GitHub Bot logged work on HIVE-18284: ----------------------------------------- Author: ASF GitHub Bot Created on: 28/Aug/20 12:18 Start Date: 28/Aug/20 12:18 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1400: URL: https://github.com/apache/hive/pull/1400#discussion_r479216917 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java ########## @@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, ReduceSinkOperator cRS, ReduceSin TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new ArrayList(), pRS .getConf().getOrder(), pRS.getConf().getNullOrder()); pRS.getConf().setKeySerializeInfo(keyTable); + } else if (cRS.getConf().getKeyCols() != null && cRS.getConf().getKeyCols().size() > 0) { Review comment: don't we need any conditional on `pRS` here? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java ########## @@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, ReduceSinkOperator cRS, ReduceSin TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new ArrayList(), pRS .getConf().getOrder(), pRS.getConf().getNullOrder()); pRS.getConf().setKeySerializeInfo(keyTable); + } else if (cRS.getConf().getKeyCols() != null && cRS.getConf().getKeyCols().size() > 0) { + ArrayList keyColNames = Lists.newArrayList(); + for (ExprNodeDesc keyCol : pRS.getConf().getKeyCols()) { + String keyColName = keyCol.getExprString(); + keyColNames.add(keyColName); + } + List fields = PlanUtils.getFieldSchemasFromColumnList(pRS.getConf().getKeyCols(), + keyColNames, 0, ""); + TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(fields, pRS.getConf().getOrder(), + pRS.getConf().getNullOrder()); + ArrayList outputKeyCols = Lists.newArrayList(); + for (int i = 0; i < fields.size(); i++) { + outputKeyCols.add(fields.get(i).getName()); + } + pRS.getConf().setOutputKeyColumnNames(outputKeyCols); + pRS.getConf().setKeySerializeInfo(keyTable); + pRS.getConf().setNumDistributionKeys(cRS.getConf().getNumDistributionKeys()); } Review comment: I think we should be merging the child into the parent inside this "if" - and we have 2 specific conditionals which are handled - so I think an else false here would be needed - to close down unhandled future cases ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 475813) Time Spent: 40m (was: 0.5h) > NPE when inserting data with 'distribute by' clause with dynpart sort optimization > ---------------------------------------------------------------------------------- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 2.3.1, 2.3.2 > Reporter: Aki Tanaka > Assignee: Syed Shameerur Rahman > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue: > *(non-vectorized , non-llap mode)* > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.vectorized.execution.enabled=false; > set hive.optimize.sort.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1513111717879_0056_1_01_000000_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)