hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "shenxingfeng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-8221) HDFS have two Standby NNs because ActiveStandbyElectorLock ephemeralOwner in ZK is different with the sessionId stored in ZKFC
Date Wed, 22 Apr 2015 09:42:59 GMT

     [ https://issues.apache.org/jira/browse/HDFS-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

shenxingfeng updated HDFS-8221:
-------------------------------
    Description: 
Firstly, NN1 is active, NN2 is standby. When NN1 become standby due to some reasons, NN2 then
take over the active state imediately. But after NN2 becoming active, It changed to standby
again. And, HDFS got two standby NN forever.
After check the log, I found that NN2 become standby beacuse It have wrong sessionID with
ActiveStandbyElectorLock ephemeralOwner stored in Znode.
And the rootcause is when NN1 go to standby, NN2 create one session A with zk, and become
active. Ideally, NN2 should have the same sessionID with ActiveStandbyElectorLock ephemeralOwner
stored in Znode, but some network reason  can result in NN2'ZKFC sessionID changed. 
So, I think when NN2 become standby due to different sessionid, NN2 should unlock the state
in Znode in order to failover again.


ActiveStandyElector.processResult
==================
Code code = Code.get(rc);
    if (isSuccess(code)) {
      // the following owner check completes verification in case the lock znode
      // creation was retried
      if (stat.getEphemeralOwner() == zkClient.getSessionId()) {
        // we own the lock znode. so we are the leader
        if (!becomeActive()) {
          reJoinElectionAfterFailureToBecomeActive();
        }
      } else {
        // we dont own the lock znode. so we are a standby.
        becomeStandby();
      }
      // the watch set by us will notify about changes
      return;
    }

ActiveStandbyElectorLock content
==================
[zk: 160.149.0.114:24002(CONNECTED) 1] get /hadoop-ha/hacluster/ActiveStandbyElectorLock

160-149-0-117 锟斤拷(锟斤拷
cZxid = 0x2000a38d9
ctime = Thu Apr 16 11:32:54 CST 2015
mZxid = 0x2000a38d9
mtime = Thu Apr 16 11:32:54 CST 2015
pZxid = 0x2000a38d9
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x164cb2b3e4b36ae4
dataLength = 38
numChildren = 0

  was:

Firstly, NN1 is active, NN2 is standby. When NN1 become standby due to some reasons, NN2 then
take over the active state imediately. But after NN2 becoming active, It changed to standby
again. And, HDFS got two standby NN forever.
After check the log, I found that NN2 become standby beacuse It have wrong sessionID with
ActiveStandbyElectorLock ephemeralOwner stored in Znode.
And the rootcause is when NN1 go to standby, NN2 create one session A with zk, and become
active. Ideally, NN2 should have the same sessionID with ActiveStandbyElectorLock ephemeralOwner
stored in Znode, but some network reason  can result in NN2'ZKFC sessionID changed. 
So, I think when NN2 become standby due to different sessionid, NN2 should unlock the state
in Znode in order to failover again.


ActiveStandyElector.processResult
==================
Code code = Code.get(rc);
    if (isSuccess(code)) {
      // the following owner check completes verification in case the lock znode
      // creation was retried
      if (stat.getEphemeralOwner() == zkClient.getSessionId()) {
        // we own the lock znode. so we are the leader
        if (!becomeActive()) {
          reJoinElectionAfterFailureToBecomeActive();
        }
      } else {
        // we dont own the lock znode. so we are a standby.
        becomeStandby();
      }
      // the watch set by us will notify about changes
      return;
    }

ActiveStandbyElectorLock content
==================
[zk: 160.149.0.114:24002(CONNECTED) 1] get /hadoop-ha/hacluster/ActiveStandbyElectorLock

160-149-0-117 锟斤拷(锟斤拷
cZxid = 0x2000a38d9
ctime = Thu Apr 16 11:32:54 CST 2015
mZxid = 0x2000a38d9
mtime = Thu Apr 16 11:32:54 CST 2015
pZxid = 0x2000a38d9
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x164cb2b3e4b36ae4
dataLength = 38
numChildren = 0


> HDFS have two Standby NNs because ActiveStandbyElectorLock ephemeralOwner in ZK is different
with the sessionId stored in ZKFC
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8221
>                 URL: https://issues.apache.org/jira/browse/HDFS-8221
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.4.1
>            Reporter: shenxingfeng
>
> Firstly, NN1 is active, NN2 is standby. When NN1 become standby due to some reasons,
NN2 then take over the active state imediately. But after NN2 becoming active, It changed
to standby again. And, HDFS got two standby NN forever.
> After check the log, I found that NN2 become standby beacuse It have wrong sessionID
with ActiveStandbyElectorLock ephemeralOwner stored in Znode.
> And the rootcause is when NN1 go to standby, NN2 create one session A with zk, and become
active. Ideally, NN2 should have the same sessionID with ActiveStandbyElectorLock ephemeralOwner
stored in Znode, but some network reason  can result in NN2'ZKFC sessionID changed. 
> So, I think when NN2 become standby due to different sessionid, NN2 should unlock the
state in Znode in order to failover again.
> ActiveStandyElector.processResult
> ==================
> Code code = Code.get(rc);
>     if (isSuccess(code)) {
>       // the following owner check completes verification in case the lock znode
>       // creation was retried
>       if (stat.getEphemeralOwner() == zkClient.getSessionId()) {
>         // we own the lock znode. so we are the leader
>         if (!becomeActive()) {
>           reJoinElectionAfterFailureToBecomeActive();
>         }
>       } else {
>         // we dont own the lock znode. so we are a standby.
>         becomeStandby();
>       }
>       // the watch set by us will notify about changes
>       return;
>     }
> ActiveStandbyElectorLock content
> ==================
> [zk: 160.149.0.114:24002(CONNECTED) 1] get /hadoop-ha/hacluster/ActiveStandbyElectorLock
> 160-149-0-117 锟斤拷(锟斤拷
> cZxid = 0x2000a38d9
> ctime = Thu Apr 16 11:32:54 CST 2015
> mZxid = 0x2000a38d9
> mtime = Thu Apr 16 11:32:54 CST 2015
> pZxid = 0x2000a38d9
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x164cb2b3e4b36ae4
> dataLength = 38
> numChildren = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message