hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
Date Wed, 08 Oct 2014 06:48:33 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163169#comment-14163169
] 

Ming Ma commented on HDFS-7208:
-------------------------------

We can work around it by setting dfs.datanode.failed.volumes.tolerated to zero so that as
long as there is one disk failure, NN will remove that DN. For the fix, there are several
possible approaches.

1. Have DN notify NN via DatanodeProtocol.reportBadBlocks for these blocks.
2. Modify DatanodeProtocol.errorReport so that DN can pass storage id to NN.
3. Have DN send blockReport for this failed storage so that NN can detect that.

Appreciate any suggestions.

> NN doesn't schedule replication when a DN storage fails
> -------------------------------------------------------
>
>                 Key: HDFS-7208
>                 URL: https://issues.apache.org/jira/browse/HDFS-7208
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>
> We found the following problem. When a storage device on a DN fails, NN continues to
believe replicas of those blocks on that storage are valid and doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When a disk
fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated
is configured to be > 0, NN still considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
>  
> {noformat}
> 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the
DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff
given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error
on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075,
ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message