hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
Date Wed, 15 Oct 2014 21:05:33 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172940#comment-14172940

Tsz Wo Nicholas Sze commented on HDFS-7208:

> The latest patch addresses all your comments, except for the allAlive one. The reason
is the patch handles deadnode separately from the failedStorage.

We need to change allAlive.  Otherwise, the while loop won't work if there is only failed
storage.  Of course, we also need to update the if-condition for dead datanode.  Here is my
    while (!allAlive) {
      allAlive = dead == null && failedStorage == null;
      if (dead != null) {

We should also call namesystem.checkSafeMode() in removeBlocksAssociatedTo(..).

> NN doesn't schedule replication when a DN storage fails
> -------------------------------------------------------
>                 Key: HDFS-7208
>                 URL: https://issues.apache.org/jira/browse/HDFS-7208
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7208-2.patch, HDFS-7208.patch
> We found the following problem. When a storage device on a DN fails, NN continues to
believe replicas of those blocks on that storage are valid and doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When a disk
fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated
is configured to be > 0, NN still considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
> {noformat}
> 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the
DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff
given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error
on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075,
ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.

This message was sent by Atlassian JIRA

View raw message