hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
Date Mon, 22 Sep 2014 13:36:33 GMT

    [ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143200#comment-14143200
] 

Rohith commented on YARN-2579:
------------------------------

This scenario could ocure if 2 thread trying to access ResourceManager#transitionToStandby().One
is from AdminService#trainsitiontostandby first and then RMFatalEventDispatcher#transitionToStandBy().
This I simulated using debug point.
The main problem is in resetting dispatcher, stops the dispatcher. Suppose, if AdminService
is stopping dispatcher but dispatcher thread is blocked for getting acquire lock on ResourceManager,
then ResourceManager never get transitioned to StandBy. It wait infinitely.

{code}
"AsyncDispatcher event handler" daemon prio=10 tid=0x00000000007ea000 nid=0x39d1 waiting for
monitor entry [0x00007fe0a77f6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:976)
	- waiting to lock <0x00000000c1f7d438> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:701)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:678)
	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
	at java.lang.Thread.run(Thread.java:745)
"IPC Server handler 0 on 45021" daemon prio=10 tid=0x00007fe0a9026800 nid=0x30ab in Object.wait()
[0x00007fe0a7cfa000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000eb3310e8> (a java.lang.Thread)
	at java.lang.Thread.join(Thread.java:1281)
	- locked <0x00000000eb3310e8> (a java.lang.Thread)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
	at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
	- locked <0x00000000eb32fef8> (a java.lang.Object)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetDispatcher(ResourceManager.java:1166)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:987)
	- locked <0x00000000c1f7d438> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:308)
	- locked <0x00000000c2038d10> (a org.apache.hadoop.yarn.server.resourcemanager.AdminService)
	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
	at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4462)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}


> Both RM's state is Active , but 1 RM is not really active.
> ----------------------------------------------------------
>
>                 Key: YARN-2579
>                 URL: https://issues.apache.org/jira/browse/YARN-2579
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.5.1
>            Reporter: Rohith
>
> I encountered a situaltion where both RM's web page was able to access and its state
displayed as Active. But One of the RM's ActiveServices were stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message