tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Bodor (Jira) <j...@apache.org>
Subject [jira] [Updated] (TEZ-4277) AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state
Date Fri, 29 Jan 2021 16:05:00 GMT

     [ https://issues.apache.org/jira/browse/TEZ-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

László Bodor updated TEZ-4277:
------------------------------
    Description: 
While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom
was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher
itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher
(however they have a common codebase for serviceStop at the moment), I'm about to put a safety
check, which solved my issue.


jstack of a hang (it hangs [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
{code}
"Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait()
[0x000070000997b000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
- locked <0x00000007ae967340> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae642cc8> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
- locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae53d5f0> (a java.lang.Object)
at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
at org.apache.tez.client.TezClient.stop(TezClient.java:775)
- locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}

during the investigation I added some log messages, that lead to this solution:

1. there is not a single event in the queue
2. event handling thread is in BLOCKED state
3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to be
stopped

{code}
2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148))
- AsyncDispatcher is draining to stop, ignoring any new events.
...
2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
...
2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002))
- Calling stop for all the services
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013))
- Exiting DAGAppMaster..GoodBye!
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153))
- Cannot wait for drain events anymore, logging current events if any...
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186))
- Current events in the queue: 0
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168))
- interrupting event handling thread, state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170))
- interrupted event handling thread, state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172))
- joining event handling thread, state: BLOCKED
{code}

under normal circumstances the event handling thread is in WAITING state (which I think is
because of queue.take)

+ I also included the timeout logic of YARN-3999 here

  was:
While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom
was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher
itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher
(however they have a common codebase for serviceStop at the moment), I'm about to put a safety
check, which solved my issue.


jstack of a hang (it hangs [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
{code}
"Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait()
[0x000070000997b000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
- locked <0x00000007ae967340> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae642cc8> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
- locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000007ae53d5f0> (a java.lang.Object)
at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
at org.apache.tez.client.TezClient.stop(TezClient.java:775)
- locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}

during the investigation I added some log messages, that lead to this solution:

1. there is not a single event in the queue
2. event handling thread is in BLOCKED state
3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to be
stopped

{code}
2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148))
- AsyncDispatcher is draining to stop, ignoring any new events.
...
2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
...
2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002))
- Calling stop for all the services
2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013))
- Exiting DAGAppMaster..GoodBye!
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153))
- Cannot wait for drain events anymore, logging current events if any...
2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186))
- Current events in the queue: 0
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168))
- interrupting event handling thread, state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170))
- interrupted event handling thread, state: BLOCKED
2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172))
- joining event handling thread, state: BLOCKED
{code}

under normal circumstances the event handling thread is in WAITING state (which I think is
because of queue.take)


> AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state
> ---------------------------------------------------------------------------------------
>
>                 Key: TEZ-4277
>                 URL: https://issues.apache.org/jira/browse/TEZ-4277
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.1
>
>
> While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom
was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher
itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher
(however they have a common codebase for serviceStop at the moment), I'm about to put a safety
check, which solved my issue.
> jstack of a hang (it hangs [here|https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157])
> {code}
> "Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait()
[0x000070000997b000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1252)
> - locked <0x00000007ae967340> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1326)
> at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae642cc8> (a java.lang.Object)
> at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
> at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
> at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
> - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> - locked <0x00000007ae53d5f0> (a java.lang.Object)
> at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
> at org.apache.tez.client.TezClient.stop(TezClient.java:775)
> - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
> at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {code}
> during the investigation I added some log messages, that lead to this solution:
> 1. there is not a single event in the queue
> 2. event handling thread is in BLOCKED state
> 3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to
be stopped
> {code}
> 2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148))
- AsyncDispatcher is draining to stop, ignoring any new events.
> ...
> 2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> ...
> 2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002))
- Calling stop for all the services
> 2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013))
- Exiting DAGAppMaster..GoodBye!
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158))
- Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
> 2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153))
- Cannot wait for drain events anymore, logging current events if any...
> 2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186))
- Current events in the queue: 0
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168))
- interrupting event handling thread, state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170))
- interrupted event handling thread, state: BLOCKED
> 2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172))
- joining event handling thread, state: BLOCKED
> {code}
> under normal circumstances the event handling thread is in WAITING state (which I think
is because of queue.take)
> + I also included the timeout logic of YARN-3999 here



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message