hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7150) MissingBlocks > 0 when all replicas are on decomm-in-progress nodes
Date Fri, 26 Sep 2014 06:11:33 GMT
Ming Ma created HDFS-7150:

             Summary: MissingBlocks > 0 when all replicas are on decomm-in-progress nodes
                 Key: HDFS-7150
                 URL: https://issues.apache.org/jira/browse/HDFS-7150
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Ming Ma

Our clusters recently have this false alert, where NN metrics MissingBlocks > 0 while all
replicas of these blocks are on decomm-in-progress nodes. Normally, when you have replicas
only on decomm-in-progress nodes, the blocks won't be counted as missing. It turns out if
decomm-in-progress nodes lost heartbeat and reconnect to NN, this could happen. The scenario
is the following.

1. Kick off decomm on several nodes across different racks.
2. NN lost heartbeat from 3 decomm-in-progress nodes around the same time. BM's neededReplications
will be updated as part of BM.removeStoredBlock process. If block A's 3 replicas happen to
be on these 3 nodes, block A will be moved to BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS
queue. So at this point, block A will be counted as missing.
3. These 3 nodes reconnect with NNs. However, block A remains in BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS
queue, until the block A is replicated to other live nodes.

The issue will be mitigated by HDFS-7128 with faster decommission. But it is better to fix
the correctness issue. When decomm-in-progress nodes reconnect with NN, blocks should be moved
out of BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. This will also give replication
of these blocks higher priority.

This message was sent by Atlassian JIRA

View raw message