hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6827) Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the target's status changing sometimes
Date Thu, 21 Aug 2014 12:16:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105331#comment-14105331

Vinayakumar B commented on HDFS-6827:

Update: In point #2, quitting the election will happen on second healthcheck callback. two
continuous callbacks will be checked before quitting election to make sure that state change
differences are not coming due to parellel transition. And on third callback {{rechedElectability(..)}}
will be called.

> Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the target's
status changing sometimes
> --------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-6827
>                 URL: https://issues.apache.org/jira/browse/HDFS-6827
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.1
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>            Priority: Critical
>         Attachments: HDFS-6827.1.patch
> In our production cluster, we encounter a scenario like this: ANN crashed due to write
journal timeout, and was restarted by the watchdog automatically, but after restarting both
of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: {color:red}Connection
reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server
up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Registered
DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode
up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting
services required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: xx/xx:13200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1 SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the failover,
as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC health monitor
doesn't realize that.

This message was sent by Atlassian JIRA

View raw message