Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Mon, 21 Nov 2011 02:27:52 +0000 (UTC)
From: "Eli Collins (Commented) (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: 
 <1018926997.49971.1321842472284.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <276905956.7649.1317306885647.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle
 disk-failures
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153950#comment-13153950 ] 

Eli Collins commented on MAPREDUCE-3121:
----------------------------------------

Looks like the change has similar assumptions as MR1, eg the boot disk is either raided or we're using a health checker script to stop the services if the boot disk fails. Worth mentioning this in the docs.

I think it would make more sense to name the classes LocalDir* instead of Disk* since we're checking local dirs and not disks. For example, we only check the given dirs so if there's a failure on another sector of the disk it won't notice. The NM won't handle boot disk failures even if it detects a failure on a dir hosted on the boot disk because it's dir-centric (ie doesn't know that the disk has failed, just that a dir has). Similarly the local dirs and log dirs may of course reside on the same disk so if we were checking disks we wouldn't need to check them independently. The DN calls this "volume checking" for the same rationale, something similar here would make sense as well. I'd call it LocalDirChecker and have it live in common next to LocalDirAllocator. This way HDFS could re-use the code.

5% seems pretty low. How did you arrive at that? Are you sure you want a 12 disk host with only 1 working disk to keep running?
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 3121.v2.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact of transient/permanent disk failures on containers. With larger number of disks per node, the ability to continue to run containers on other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira