tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (TEZ-4087) Shuffle: Check for thread's liveliness regularly to avoid infinite wait in merger & referee threads
Date Thu, 10 Oct 2019 03:45:00 GMT

    [ https://issues.apache.org/jira/browse/TEZ-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948176#comment-16948176
] 

Rajesh Balamohan edited comment on TEZ-4087 at 10/10/19 3:44 AM:
-----------------------------------------------------------------

Got more details on the logs. Pasting a snippet for reference.

{noformat}

2019-10-08T18:59:56,420 ERROR [TezTR-474717_36_29692_1_60_0 (1570366474717_0036_29692_01_000060_0)]
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
        at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)
        at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)
        at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:141)
        at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:136)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2019-10-08T18:59:56,421 WARN  [TezTR-474717_36_29692_1_60_0 (1570366474717_0036_29692_01_000060_0)]
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler: Interrupted
while shutting down referee. Ignoring and continuing shutdown
2019-10-08T18:59:56,421 WARN  [TezTR-474717_36_29692_1_60_0 (1570366474717_0036_29692_01_000060_0)]
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput: Reducer 3: Attempting to
close output Reducer 3 of type OrderedPartitionedKVOutput before it was started. Generating
empty events
{noformat}


https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleScheduler.java#L489

Thread interruption in the above code prevents further cleanup of merger threads in https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Shuffle.java#L390

It would be good to move `cleanupMerger` in the finally block, so that it gets a chance to
clear up the merge threads. I will rename the jira heading accordingly.



was (Author: rajesh.balamohan):
Got more details on the logs. Pasting a snippet for reference.

{noformat}

2019-10-08T18:59:56,420 ERROR [TezTR-474717_36_29692_1_60_0 (1570366474717_0036_29692_01_000060_0)]
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
        at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)
        at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)
        at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:141)
        at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:136)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2019-10-08T18:59:56,421 WARN  [TezTR-474717_36_29692_1_60_0 (1570366474717_0036_29692_01_000060_0)]
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler: Interrupted
while shutting down referee. Ignoring and continuing shutdown
2019-10-08T18:59:56,421 WARN  [TezTR-474717_36_29692_1_60_0 (1570366474717_0036_29692_01_000060_0)]
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput: Reducer 3: Attempting to
close output Reducer 3 of type OrderedPartitionedKVOutput before it was started. Generating
empty events
{noformat}


https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleScheduler.java#L489

Thread interruption in the above code prevents further cleanup of merger threads in https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Shuffle.java#L390

It would be good to move `cleanupMerger` in the finally block, so that it gets a chance to
clear up the merge threads. 


> Shuffle: Check for thread's liveliness regularly to avoid infinite wait in merger &
referee threads
> ---------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4087
>                 URL: https://issues.apache.org/jira/browse/TEZ-4087
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> In certain cases, Shuffle's cleanupIgnoreErrors() is not called. This leaves 4 threads
(inmem, diskmerger, Referee, ShuffleAndMergeRunner) run forever.
> When these are run in long running processes (e.g LLAP in Hive), they reach the thread
limits over time.
> Note: Root cause why cleanupIgnoreErrors() is not invoked is not yet known. I will share
the details when i get more details on this. Creating this ticket to add additional safety
knobs to ensure that thread leaks do not happen.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message