hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
Date Fri, 21 Mar 2014 00:10:46 GMT

     [ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Vinod Kumar Vavilapalli updated YARN-1861:

    Component/s: resourcemanager
       Assignee: Vinod Kumar Vavilapalli

I debugged this for a while with Arpit's help. I think this can happen because we have two
zookeeper sessions inside the RM and one of them can fail while the other is still alive.

In this case, RM1 lost the ZK session inside the ZKRMStateStore but the session inside the
leader-election code was still active. RM1 thus got stuck in standby mode, RM2 was anyways
already in standby mode and the cluster was stuck.

When I manually deleted the ZK locks, leader election kicked back in and RM1 itself became
active again.

> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>                 Key: YARN-1861
>                 URL: https://issues.apache.org/jira/browse/YARN-1861
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Vinod Kumar Vavilapalli
> In our HA tests we noticed that the tests got stuck because both RM's got into standby
state and no one became active.

This message was sent by Atlassian JIRA

View raw message