hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jitendra Nath Pandey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
Date Fri, 16 Oct 2015 23:41:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961540#comment-14961540

Jitendra Nath Pandey commented on HDFS-9239:

bq. .. Well before node liveness is affected by inundation of IBRs and FBRs, the namenode
performance will degrade to unacceptable level...

  Yes, indeed. But if datanodes are marked as dead in that situation, that completely destabilizes
the system. At that point, even if we kill certain offending jobs, it takes a while before
NN can come back to an acceptable service level. This proposal should help prevent the death
after NN is past the overloading scenario.

  I think ZKFC healthcheck should also be separated into a different queue or port so that
they are not blocked by other messages in NN's call queue. A failover because NN is busy is
not very helpful. The other NN also gets busy and we end up seeing active-standby flip-flop
between the namenodes.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
> -----------------------------------------------------------------------------------
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf
> This issue proposes introduction of a new feature: the DataNode Lifeline Protocol.  This
is an RPC protocol that is responsible for reporting liveness and basic health information
about a DataNode to a NameNode.  Compared to the existing heartbeat messages, it is lightweight
and not prone to resource contention problems that can harm accurate tracking of DataNode
liveness currently.  The attached design document contains more details.

This message was sent by Atlassian JIRA

View raw message