hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharath Mundlapudi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1848) Datanodes should shutdown when a critical volume fails
Date Thu, 21 Apr 2011 20:35:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022905#comment-13022905

Bharath Mundlapudi commented on HDFS-1848:

Thanks Eli for explaining on the usecase. I briefly talked to Koji about this Jira. 

Some more thoughts on this. 

1. If fs.data.dir.critical is not defined, then implementation should fall back to existing
tolerate a volume failure case. 

2. If fs.data.dir.critical is defined, then fail-fast and fail-stop as you described. 

Case 2 you mentioned is interesting too. Today, datanode is not aware of this case since it
may not be part of the dfs.data.dir config. 

I see that the key benefit of having this Jira is fail-fast. Meaning, if any of the critical
volume(s) fail, we let the namenode know immediately and datanode will exit. So the replication
will be taken care and cluster/datanode restarts might see less issues with missing blocks.

W.r.t case 2 you mentioned, there are the possibilites of failures, right?

1. Data is stored on root partition disk say /root/hadoop (binaries,conf,log), /root/data0
Failures: /root readonly filesystem or failure, /root/data0 readonly filesystem or failure,
complete disk0 failure.

2. Data NOT stored on root partition disk, /root(disk1), /data0(disk2)
Failures:  /root readonly filesystem or failure, /data0(disk2) readonly filesystem or failure.

3. Swap partition failure
How will this be detected?

I am wondering, if datanode should worry about all these issues regarding its health or should
configuration like in TaskTracker for health check script which will let Datanode about the
disk issues, 
network issues etc is a better option?


> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, logs, pid,
tmp dir etc.) fails. The admin should be able to specify which volumes are critical, eg they
might specify the volume that lives on the boot disk. A failure in one of these volumes would
not be subject to the threshold (HDFS-1161) or result in host decommissioning (HDFS-1847)
as the decommissioning process would likely fail.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message