hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
Date Tue, 14 Jul 2015 18:35:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626826#comment-14626826
] 

Xuan Gong commented on YARN-3893:
---------------------------------

Thanks for [~varun_saxena] and [~sunilg]. I am fine with adding a new internal state although
it might be too complex. But if we could handle this correctly, I am fine with this.

To this specific issue, I think that at least two things we should do here:
1) stop All ActiveService
2) transit to standby. (basically, set RM state in RMContext as Standby)
But, we also need to reinitiate all the active service to prepare for the transitToActive
call. 
At least, we should do:
{code}
rm.transitToStandy(false);
reinitiateActiveService();
{code}
Here the reinitiateActiveService() can throw out the same exception. And I can see why this
does not solve the whole problem.

How about we introduce a new atomicBoolean flag to track whether we need to reinitiate active
service ? And we could add following into transitToActive logic
{code}
    if (reinitiateRequired)
       reinitiateActiveService()
{code}
before we start all the active service.

> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> ------------------------------------------------------------------------------
>
>                 Key: YARN-3893
>                 URL: https://issues.apache.org/jira/browse/YARN-3893
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: yarn-site.xml
>
>
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
./yarn rmadmin  -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
./yarn rmadmin  -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message