tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Bodor (Jira) <j...@apache.org>
Subject [jira] [Comment Edited] (TEZ-4277) AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state
Date Fri, 29 Jan 2021 16:29:00 GMT

    [ https://issues.apache.org/jira/browse/TEZ-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274553#comment-17274553
] 

László Bodor edited comment on TEZ-4277 at 1/29/21, 4:28 PM:
-------------------------------------------------------------

https://github.com/apache/tez/pull/97
cc: [~ashutoshc]


was (Author: abstractdog):
https://github.com/apache/tez/pull/97

> AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state
> ---------------------------------------------------------------------------------------
>
>                 Key: TEZ-4277
>                 URL: https://issues.apache.org/jira/browse/TEZ-4277
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.1
>
>
> While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom
was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher
itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher
(however they have a common codebase for serviceStop at the moment), I'm about to put a safety
check, which solved my issue.
> jstack of a hang (it hangs [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
> {code}
> "Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait()
[0x000070000997b000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1252)
> - locked <0x00000007ae967340> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1326)
> at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae642cc8> (a java.lang.Object)
> at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
> at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
> at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
> - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae53d5f0> (a java.lang.Object)
> at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
> at org.apache.tez.client.TezClient.stop(TezClient.java:775)
> - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
> at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {code}
> during the investigation I added some log messages, that lead to this solution:
> 1. there is not a single event in the queue
> 2. event handling thread is in BLOCKED state
> 3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to
be stopped
> {code}
> 2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148))
- AsyncDispatcher is draining to stop, ignoring any new events.
> ...
> 2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> ...
> 2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002))
- Calling stop for all the services
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013))
- Exiting DAGAppMaster..GoodBye!
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153))
- Cannot wait for drain events anymore, logging current events if any...
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186))
- Current events in the queue: 0
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168))
- interrupting event handling thread, state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170))
- interrupted event handling thread, state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172))
- joining event handling thread, state: BLOCKED
> {code}
> under normal circumstances the event handling thread is in WAITING state (which I think
is because of queue.take)
> + I also included the timeout logic of YARN-3999 here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message