Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Thu, 21 Apr 2011 20:35:05 +0000 (UTC)
From: "Bharath Mundlapudi (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: 
 <1123736417.74483.1303418105875.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <411101893.68583.1303257725837.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HDFS-1848) Datanodes should shutdown when a
 critical volume fails
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022905#comment-13022905 ] 

Bharath Mundlapudi commented on HDFS-1848:
------------------------------------------

Thanks Eli for explaining on the usecase. I briefly talked to Koji about this Jira. 

Some more thoughts on this. 

1. If fs.data.dir.critical is not defined, then implementation should fall back to existing tolerate a volume failure case. 

2. If fs.data.dir.critical is defined, then fail-fast and fail-stop as you described. 

Case 2 you mentioned is interesting too. Today, datanode is not aware of this case since it may not be part of the dfs.data.dir config. 

I see that the key benefit of having this Jira is fail-fast. Meaning, if any of the critical volume(s) fail, we let the namenode know immediately and datanode will exit. So the replication will be taken care and cluster/datanode restarts might see less issues with missing blocks. 

W.r.t case 2 you mentioned, there are the possibilites of failures, right?

1. Data is stored on root partition disk say /root/hadoop (binaries,conf,log), /root/data0
Failures: /root readonly filesystem or failure, /root/data0 readonly filesystem or failure, complete disk0 failure.

2. Data NOT stored on root partition disk, /root(disk1), /data0(disk2)
Failures:  /root readonly filesystem or failure, /data0(disk2) readonly filesystem or failure.

3. Swap partition failure
How will this be detected?

I am wondering, if datanode should worry about all these issues regarding its health or should a 
configuration like in TaskTracker for health check script which will let Datanode about the disk issues, 
network issues etc is a better option?


> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) fails. The admin should be able to specify which volumes are critical, eg they might specify the volume that lives on the boot disk. A failure in one of these volumes would not be subject to the threshold (HDFS-1161) or result in host decommissioning (HDFS-1847) as the decommissioning process would likely fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira