hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zesheng Wu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-6827) NameNode double standby
Date Wed, 06 Aug 2014 05:37:12 GMT
Zesheng Wu created HDFS-6827:

             Summary: NameNode double standby
                 Key: HDFS-6827
                 URL: https://issues.apache.org/jira/browse/HDFS-6827
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha
    Affects Versions: 2.4.1
            Reporter: Zesheng Wu
            Assignee: Zesheng Wu

In our production cluster, we encounter a scenario like this: ANN crashed due to write journal
timeout, and was restarted by the watchdog automatically, but after restarting both of the
NNs are standby.

Following is the logs of the scenario:
# NN1 is down due to write journal timeout:
{color:red}2014-08-03,23:02:02,219{color} INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
# ZKFC1 detected "connection reset by peer"
{color:red}2014-08-03,23:02:02,560{color} ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: {color:red}Connection
reset by peer{color}
# NN1 wat restarted successfully by the watchdog:
2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up
at: xx:13201
2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
{color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 13200: starting
2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean thread started!
2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Registered DFSClientInformation
2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode up
at: xx/xx:13200
2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting
services required for standby state
# ZKFC1 retried the connection and considered NN1 was healthy
{color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xx/xx:13200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1 SECONDS)
# ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the failover, as a
result, both NNs were standby.

The root cause of this bug is that NN is restarted too quickly and ZKFC health monitor doesn't
realize that.

This message was sent by Atlassian JIRA

View raw message