hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
Date Wed, 24 Sep 2014 14:43:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146381#comment-14146381
] 

Wangda Tan commented on YARN-2594:
----------------------------------

Hi [~devaraj.k],
Have you already looked into that? I think I've found the root cause of this problem already,
could you assign this ticket to me?

This is a deadlock between the two pairs:
{code}
"IPC Server handler 45 on 8032" daemon prio=10 tid=0x00007f032909b000 nid=0x7bd7 waiting for
monitor entry [0x00007f0307aa9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:541)
	- waiting to lock <0x00000000e0e7ea70> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:196)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:703)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:569)
	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:294)
	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

And 

{code}
"ResourceManager Event Processor" prio=10 tid=0x00007f0328db9800 nid=0x7aeb waiting on condition
[0x00007f0311a48000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e0e72bc0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getCurrentAppAttempt(RMAppImpl.java:476)
	at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:509)
	at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:495)
	at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:484)
	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
	- locked <0x00000000e0e85318> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
	at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:373)
	at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:89)
	- locked <0x00000000e0e7ea70> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1433)
	- locked <0x00000000e01a57b8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1124)
	- locked <0x00000000e011aea0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:884)
	- locked <0x00000000e011aea0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:989)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:93)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:612)
	at java.lang.Thread.run(Thread.java:745)
{code}

In a short word, when during container completion, SchedulerApplicationAttempt will finally
fetch lock of RMAppAttempt to update resource usage metrics. If at the same time, application
client try to get application report, RMAppAttempt will fetch lock of SchedulerApplicationAttempt.
That is the pair of deadlock.

A simple solution is, remove synchronized of SchedulerApplicationAttempt#getRunningAggregateAppResourceUsage,
use java.util.concurrent.atomic variables instead. That will eliminate lock of RMAttemptImpl
- > SchedulerApplicationAttempt when getting ApplicationReport.


> ResourceManger sometimes become un-responsive
> ---------------------------------------------
>
>                 Key: YARN-2594
>                 URL: https://issues.apache.org/jira/browse/YARN-2594
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karam Singh
>            Assignee: Devaraj K
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message