ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-13396) RU: Handle Namenode being down scenarios
Date Tue, 13 Oct 2015 03:54:06 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954327#comment-14954327
] 

Hadoop QA commented on AMBARI-13396:
------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12766244/AMBARI-13396.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include any new or modified
tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in ambari-server.

Test results: https://builds.apache.org/job/Ambari-trunk-test-patch/3951//testReport/
Console output: https://builds.apache.org/job/Ambari-trunk-test-patch/3951//console

This message is automatically generated.

> RU: Handle Namenode being down scenarios
> ----------------------------------------
>
>                 Key: AMBARI-13396
>                 URL: https://issues.apache.org/jira/browse/AMBARI-13396
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.1.2
>            Reporter: Jayush Luniya
>            Assignee: Jayush Luniya
>             Fix For: 2.1.3
>
>         Attachments: AMBARI-13396.patch
>
>
> There are 2 scenarios that need to be handled during RU
> *Setup:*
> * host1 : namenode1, host2 :namenode2
> * namenode1 on node1 is down
> *Scenario 1: During RU, namenode1 on host1 is going to be upgraded before namenode2 on
host2*
> Since namenode1 on host1 is already down, namenode2 is the active namenode. So we  should
fix the logic to simply restart namenode1 as namenode2 will remain active.
> *Scenario 2: During RU, namenode2 on host2 is going to be upgraded before namenode1 on
host1*
> Since namenode2 on host2 is active, then we should fail, since there isn't another namenode
instance that can become active. However today we do the following: 
> # Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not healthy.
> # When this command fails, we kill ZKFC on this host and then we wait for this instance
to come back as standby which will never happen because this instance will come back as active.

> We should simply fail when "haadmin failover" command fails instead of killing ZKFC.
> {noformat}
> 2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on active NameNode
host jay-ams-2.c.pramod-thangali.internal.
> 2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] {'logoutput': True,
'user': 'hdfs'}
> Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020
is not currently healthy. Cannot be failover target
> 	at org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)
> 	at org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)
> 	at org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)
> 	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)
> 	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)
> 	at org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
> 	at org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
> 	at org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020
is not currently healthy. Cannot be failover target\n\tat org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat
org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat
org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat
org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat java.security.AccessController.doPrivileged(Native
Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat
org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat
org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat java.security.AccessController.doPrivileged(Native
Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)')
> 2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255
> 2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid
> /dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`
> /dev/null 2>&1''] {}
> 2015-10-12 22:35:17,777 - call returned (0, '')
> 2015-10-12 22:35:17,778 - Execute['kill -15 `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`']
{'user': 'hdfs'}
> 2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] {'action':
['delete']}
> 2015-10-12 22:35:17,803 - Deleting File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid']
> 2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep standby'] {'logoutput':
True, 'user': 'hdfs'}
> 2015-10-12 22:35:20,922 - call returned (1, '')
> 2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1
> 2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one.
> 2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep standby']
{'logoutput': True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6}
> 2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin
-getServiceState nn2 | grep standby' returned 1. 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message