hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6827) Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the target's status changing sometimes
Date Thu, 21 Aug 2014 12:10:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105330#comment-14105330

Vinayakumar B commented on HDFS-6827:

Hi [~wuzesheng],

Please check the {{ZKFailOverController#verifyChangedServiceState(..)}} this will be called
for every health check callback, which will happen every 1 sec by default.
In your case,
1. Service is HEALTHY, even though NN was restarted within this 1 sec interval.
2. But, after restart of NN, first healthcheck callback will identify state change of ANN
in {{ZKFailOverController#verifyChangedServiceState(..)}} and ZKFC will quit the election
marking {{quitElectionOnBadState}} to true. Now another STANDBY ZKFC will have a chance to
become ACTIVE.
3.  Next callback from healthcheck will call {{rechedElectability()}} which will in-turn make
ZKFC to join the election back if the service is still HEALTHY. Meanwhile other ZKFC would
have won the leader election and became ACTIVE.

So after HADOOP-10251, I feel your problem also will be solved. 
Have you tried same scenario in latest trunk code?

> Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the target's
status changing sometimes
> --------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-6827
>                 URL: https://issues.apache.org/jira/browse/HDFS-6827
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.1
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>            Priority: Critical
>         Attachments: HDFS-6827.1.patch
> In our production cluster, we encounter a scenario like this: ANN crashed due to write
journal timeout, and was restarted by the watchdog automatically, but after restarting both
of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: {color:red}Connection
reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server
up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Registered
DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode
up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting
services required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: xx/xx:13200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1 SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the failover,
as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC health monitor
doesn't realize that.

This message was sent by Atlassian JIRA

View raw message