Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF7F4976F for ; Mon, 21 Nov 2011 02:28:13 +0000 (UTC) Received: (qmail 21783 invoked by uid 500); 21 Nov 2011 02:28:13 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 21742 invoked by uid 500); 21 Nov 2011 02:28:13 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 21734 invoked by uid 99); 21 Nov 2011 02:28:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Nov 2011 02:28:13 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Nov 2011 02:28:12 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4501393BE4 for ; Mon, 21 Nov 2011 02:27:52 +0000 (UTC) Date: Mon, 21 Nov 2011 02:27:52 +0000 (UTC) From: "Eli Collins (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1018926997.49971.1321842472284.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <276905956.7649.1317306885647.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153950#comment-13153950 ] Eli Collins commented on MAPREDUCE-3121: ---------------------------------------- Looks like the change has similar assumptions as MR1, eg the boot disk is either raided or we're using a health checker script to stop the services if the boot disk fails. Worth mentioning this in the docs. I think it would make more sense to name the classes LocalDir* instead of Disk* since we're checking local dirs and not disks. For example, we only check the given dirs so if there's a failure on another sector of the disk it won't notice. The NM won't handle boot disk failures even if it detects a failure on a dir hosted on the boot disk because it's dir-centric (ie doesn't know that the disk has failed, just that a dir has). Similarly the local dirs and log dirs may of course reside on the same disk so if we were checking disks we wouldn't need to check them independently. The DN calls this "volume checking" for the same rationale, something similar here would make sense as well. I'd call it LocalDirChecker and have it live in common next to LocalDirAllocator. This way HDFS could re-use the code. 5% seems pretty low. How did you arrive at that? Are you sure you want a 12 disk host with only 1 working disk to keep running? > NodeManager should handle disk-failures > --------------------------------------- > > Key: MAPREDUCE-3121 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager > Affects Versions: 0.23.0 > Reporter: Vinod Kumar Vavilapalli > Assignee: Ravi Gummadi > Priority: Blocker > Fix For: 0.23.1 > > Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 3121.v2.patch > > > This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact of transient/permanent disk failures on containers. With larger number of disks per node, the ability to continue to run containers on other disks is crucial. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira