ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jayush Luniya (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-13396) RU: Handle Namenode being down scenarios
Date Wed, 14 Oct 2015 22:46:05 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957950#comment-14957950
] 

Jayush Luniya commented on AMBARI-13396:
----------------------------------------

Trunk:
commit c00908495953e7c725bd49b7a124883d12621324
Author: Jayush Luniya <jluniya@hortonworks.com>
Date:   Wed Oct 14 15:43:34 2015 -0700

    AMBARI-13396: RU: Handle Namenode being down scenarios (jluniya)

Branch-2.1
commit dddc760f30557a552f0fc8e25c9941d9717ece7c
Author: Jayush Luniya <jluniya@hortonworks.com>
Date:   Wed Oct 14 15:43:34 2015 -0700

    AMBARI-13396: RU: Handle Namenode being down scenarios (jluniya)

> RU: Handle Namenode being down scenarios
> ----------------------------------------
>
>                 Key: AMBARI-13396
>                 URL: https://issues.apache.org/jira/browse/AMBARI-13396
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.1.2
>            Reporter: Jayush Luniya
>            Assignee: Jayush Luniya
>             Fix For: 2.1.3
>
>         Attachments: AMBARI-13396.patch
>
>
> There are 2 scenarios that need to be handled during RU
> *Setup:*
> * host1 : namenode1, host2 :namenode2
> * namenode1 on node1 is down
> *Scenario 1: During RU, namenode1 on host1 is going to be upgraded before namenode2 on
host2*
> Since namenode1 on host1 is already down, namenode2 is the active namenode. So we  should
fix the logic to simply restart namenode1 as namenode2 will remain active.
> *Scenario 2: During RU, namenode2 on host2 is going to be upgraded before namenode1 on
host1*
> Since namenode2 on host2 is active, then we should fail, since there isn't another namenode
instance that can become active. However today we do the following: 
> # Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not healthy.
> # When this command fails, we kill ZKFC on this host and then we wait for this instance
to come back as standby which will never happen because this instance will come back as active.

> We should simply fail when "haadmin failover" command fails instead of killing ZKFC.
> {noformat}
> 2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on active NameNode
host jay-ams-2.c.pramod-thangali.internal.
> 2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] {'logoutput': True,
'user': 'hdfs'}
> Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020
is not currently healthy. Cannot be failover target
> 	at org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)
> 	at org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)
> 	at org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)
> 	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)
> 	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)
> 	at org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
> 	at org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
> 	at org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020
is not currently healthy. Cannot be failover target\n\tat org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat
org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat
org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat
org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat java.security.AccessController.doPrivileged(Native
Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat
org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat
org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat java.security.AccessController.doPrivileged(Native
Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)')
> 2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255
> 2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid
> /dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`
> /dev/null 2>&1''] {}
> 2015-10-12 22:35:17,777 - call returned (0, '')
> 2015-10-12 22:35:17,778 - Execute['kill -15 `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`']
{'user': 'hdfs'}
> 2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] {'action':
['delete']}
> 2015-10-12 22:35:17,803 - Deleting File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid']
> 2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep standby'] {'logoutput':
True, 'user': 'hdfs'}
> 2015-10-12 22:35:20,922 - call returned (1, '')
> 2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1
> 2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one.
> 2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep standby']
{'logoutput': True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6}
> 2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message