hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7714) Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode.
Date Fri, 30 Jan 2015 21:05:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299223#comment-14299223
] 

Chris Nauroth commented on HDFS-7714:
-------------------------------------

Here are more details on what I've observed.  I saw that the main {{BPServiceActor#run}} loop
was active for one NameNode, but for the other one, it had reported the fatal "Initialization
failed" error from this part of the code:

{code}
      while (true) {
        // init stuff
        try {
          // setup storage
          connectToNNAndHandshake();
          break;
        } catch (IOException ioe) {
          // Initial handshake, storage recovery or registration failed
          runningState = RunningState.INIT_FAILED;
          if (shouldRetryInit()) {
            // Retry until all namenode's of BPOS failed initialization
            LOG.error("Initialization failed for " + this + " "
                + ioe.getLocalizedMessage());
            sleepAndLogInterrupts(5000, "initializing");
          } else {
            runningState = RunningState.FAILED;
            LOG.fatal("Initialization failed for " + this + ". Exiting. ", ioe);
            return;
          }
        }
      }
{code}

The {{ioe}} was an {{EOFException}} while trying the {{registerDatanode}} RPC.  Lining up
timestamps from NN and DN logs, I could see that the NN had restarted at the same time, causing
it to abandon this RPC connection, ultimately triggering the {{EOFException}} on the DataNode
side.

Most importantly, the fact that it was on the code path with the fatal-level logging means
that it would never reattempt registration with this NameNode.  {{shouldRetryInit()}} must
have returned {{false}}.  The implementation of {{BPOfferService#shouldRetryInit}} is that
it should only retry if the other one already registered successfully:

{code}
  /*
   * Let the actor retry for initialization until all namenodes of cluster have
   * failed.
   */
  boolean shouldRetryInit() {
    if (hasBlockPoolId()) {
      // One of the namenode registered successfully. lets continue retry for
      // other.
      return true;
    }
    return isAlive();
  }
{code}

Tying that all together, this bug happens when the first attempted NameNode registration fails
but the second succeeds.  The DataNode process remains running, but with only one live {{BPServiceActor}}.

HDFS-2882 had a lot of discussion of DataNode startup failure scenarios.  I think the summary
of that discussion is that the DataNode should in general retry its NameNode registrations,
but it should instead abort right away if there is no possibility for registration to be successful.
 (i.e. There is a misconfiguration or a hardware failure.)  I think the change we need here
is that we should keep retrying the {{registerDatanode}} RPC if there is NameNode downtime
or transient connectivity failure.  Other failure reasons should still cause an abort.


> Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully
with only one NameNode.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7714
>                 URL: https://issues.apache.org/jira/browse/HDFS-7714
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Chris Nauroth
>
> In an HA deployment, DataNodes must register with both NameNodes and send periodic heartbeats
and block reports to both.  However, if NameNodes and DataNodes are restarted simultaneously,
then this can trigger a race condition in registration.  The end result is that the {{BPServiceActor}}
for one NameNode terminates, but the {{BPServiceActor}} for the other NameNode remains alive.
 The DataNode process is then in a "half-alive" state where it only heartbeats and sends block
reports to one of the NameNodes.  This could cause a loss of storage capacity after an HA
failover.  The DataNode process would have to be restarted to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message