Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8FB732972 for ; Thu, 21 Apr 2011 20:35:44 +0000 (UTC) Received: (qmail 88825 invoked by uid 500); 21 Apr 2011 20:35:44 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 88731 invoked by uid 500); 21 Apr 2011 20:35:44 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 88718 invoked by uid 99); 21 Apr 2011 20:35:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Apr 2011 20:35:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Apr 2011 20:35:43 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id D68E0AD090 for ; Thu, 21 Apr 2011 20:35:05 +0000 (UTC) Date: Thu, 21 Apr 2011 20:35:05 +0000 (UTC) From: "Bharath Mundlapudi (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1123736417.74483.1303418105875.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <411101893.68583.1303257725837.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-1848) Datanodes should shutdown when a critical volume fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022905#comment-13022905 ] Bharath Mundlapudi commented on HDFS-1848: ------------------------------------------ Thanks Eli for explaining on the usecase. I briefly talked to Koji about this Jira. Some more thoughts on this. 1. If fs.data.dir.critical is not defined, then implementation should fall back to existing tolerate a volume failure case. 2. If fs.data.dir.critical is defined, then fail-fast and fail-stop as you described. Case 2 you mentioned is interesting too. Today, datanode is not aware of this case since it may not be part of the dfs.data.dir config. I see that the key benefit of having this Jira is fail-fast. Meaning, if any of the critical volume(s) fail, we let the namenode know immediately and datanode will exit. So the replication will be taken care and cluster/datanode restarts might see less issues with missing blocks. W.r.t case 2 you mentioned, there are the possibilites of failures, right? 1. Data is stored on root partition disk say /root/hadoop (binaries,conf,log), /root/data0 Failures: /root readonly filesystem or failure, /root/data0 readonly filesystem or failure, complete disk0 failure. 2. Data NOT stored on root partition disk, /root(disk1), /data0(disk2) Failures: /root readonly filesystem or failure, /data0(disk2) readonly filesystem or failure. 3. Swap partition failure How will this be detected? I am wondering, if datanode should worry about all these issues regarding its health or should a configuration like in TaskTracker for health check script which will let Datanode about the disk issues, network issues etc is a better option? > Datanodes should shutdown when a critical volume fails > ------------------------------------------------------ > > Key: HDFS-1848 > URL: https://issues.apache.org/jira/browse/HDFS-1848 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node > Reporter: Eli Collins > Fix For: 0.23.0 > > > A DN should shutdown when a critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) fails. The admin should be able to specify which volumes are critical, eg they might specify the volume that lives on the boot disk. A failure in one of these volumes would not be subject to the threshold (HDFS-1161) or result in host decommissioning (HDFS-1847) as the decommissioning process would likely fail. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira