hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7714) Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode.
Date Fri, 30 Jan 2015 23:03:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299338#comment-14299338
] 

Kihwal Lee commented on HDFS-7714:
----------------------------------

On a related note, I've seen similar symproms when the two namenodes' ctimes in their storage
are different. After a datanode registers with one nn, it won't be able to register with the
other and cause the actor thread to die. Depending on whom each datanode talk to first, they
will be divided into two sets, each of which talking to only one namenode, thus creating a
split brain situation.  Of course, running two namenodes with different storage version is
a mistake, but I've seen people making this kind of mistake multiple times. Whenever it happened,
I wished for a way to start the actor thread back up. The refreshNamenodes dfs admin command
does not work for HA configuration.

> Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully
with only one NameNode.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7714
>                 URL: https://issues.apache.org/jira/browse/HDFS-7714
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Chris Nauroth
>
> In an HA deployment, DataNodes must register with both NameNodes and send periodic heartbeats
and block reports to both.  However, if NameNodes and DataNodes are restarted simultaneously,
then this can trigger a race condition in registration.  The end result is that the {{BPServiceActor}}
for one NameNode terminates, but the {{BPServiceActor}} for the other NameNode remains alive.
 The DataNode process is then in a "half-alive" state where it only heartbeats and sends block
reports to one of the NameNodes.  This could cause a loss of storage capacity after an HA
failover.  The DataNode process would have to be restarted to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message