hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Agarwal (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-9305) Delayed heartbeat processing causes storm of subsequent heartbeats
Date Mon, 26 Oct 2015 04:43:27 GMT
Arpit Agarwal created HDFS-9305:

             Summary: Delayed heartbeat processing causes storm of subsequent heartbeats
                 Key: HDFS-9305
                 URL: https://issues.apache.org/jira/browse/HDFS-9305
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.7.1
            Reporter: Arpit Agarwal
            Assignee: Arpit Agarwal

A DataNode typically sends a heartbeat to the NameNode every 3 seconds.  We expect heartbeat
handling to complete relatively quickly.  However, if something unexpected causes heartbeat
processing to get blocked, such as a long GC or heavy lock contention within the NameNode,
then heartbeat processing would be delayed.  After recovering from this delay, the DataNode
then starts sending a storm of heartbeat messages in a tight loop.  In a large cluster with
many DataNodes, this storm of heartbeat messages could cause harmful load on the NameNode
and make overall cluster recovery more difficult.

The bug appears to be caused by incorrect timekeeping inside {{BPServiceActor}}.  The next
heartbeat time is always calculated as a delta from the previous heartbeat time, without any
compensation for possible long latency on an individual heartbeat RPC.  The only mitigation
would be restarting all DataNodes to force a reset of the heartbeat schedule, or simply wait
out the storm until the scheduling catches up and corrects itself.

This problem would not manifest after a NameNode restart.  In that case, the NameNode would
respond to the first heartbeat by telling the DataNode to re-register, and {{BPServiceActor#reRegister}}
would reset the heartbeat schedule to the current time.  I believe the problem would only
manifest if the NameNode process kept alive, but processed heartbeats unexpectedly slowly.

This message was sent by Atlassian JIRA

View raw message