ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jayush Luniya (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-13396) RU: Handle Namenode being down scenarios
Date Tue, 13 Oct 2015 01:31:05 GMT
Jayush Luniya created AMBARI-13396:
--------------------------------------

             Summary: RU: Handle Namenode being down scenarios
                 Key: AMBARI-13396
                 URL: https://issues.apache.org/jira/browse/AMBARI-13396
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.1.2
            Reporter: Jayush Luniya
            Assignee: Jayush Luniya
             Fix For: 2.1.3


There are 2 scenarios that need to be handled during RU

*Setup:*
* host1 : namenode1, host2 :namenode2
* namenode1 on node1 is down

*Scenario 1: During RU, namenode1 on host1 is going to be upgraded before namenode2 on host2*
Since namenode1 on host1 is already down, namenode2 is the active namenode. So we  should
fix the logic to simply restart namenode1 as namenode2 will remain active.

*Scenario 2: During RU, namenode2 on host2 is going to be upgraded before namenode1 on host1*
Since namenode2 on host2 is active, then we should fail, since there isn't another namenode
instance that can become active. However today we do the following: 
# Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not healthy.
# When this command fails, we kill ZKFC on this host and then we wait for this instance to
come back as standby which will never happen because this instance will come back as active.


We should simply fail when "haadmin failover" command fails instead of killing ZKFC.

{noformat}
2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on active NameNode
host jay-ams-2.c.pramod-thangali.internal.
2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] {'logoutput': True, 'user':
'hdfs'}
Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not
currently healthy. Cannot be failover target
	at org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)
	at org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)
	at org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)
	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)
	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)
	at org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
	at org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
	at org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)

2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020
is not currently healthy. Cannot be failover target\n\tat org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat
org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat
org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat
org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat java.security.AccessController.doPrivileged(Native
Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat
org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat
org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat java.security.AccessController.doPrivileged(Native
Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)')
2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255
2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid
> /dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`
> /dev/null 2>&1''] {}
2015-10-12 22:35:17,777 - call returned (0, '')
2015-10-12 22:35:17,778 - Execute['kill -15 `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`']
{'user': 'hdfs'}
2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] {'action': ['delete']}
2015-10-12 22:35:17,803 - Deleting File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid']
2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep standby'] {'logoutput':
True, 'user': 'hdfs'}
2015-10-12 22:35:20,922 - call returned (1, '')
2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1
2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one.
2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep standby'] {'logoutput':
True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6}
2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState
nn2 | grep standby' returned 1. 
2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState
nn2 | grep standby' returned 1. 
2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState
nn2 | grep standby' returned 1. 
2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState
nn2 | grep standby' returned 1. 
2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState
nn2 | grep standby' returned 1. 
2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState
nn2 | grep standby' returned 1. 
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message