hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures
Date Mon, 21 Nov 2011 02:27:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153950#comment-13153950

Eli Collins commented on MAPREDUCE-3121:

Looks like the change has similar assumptions as MR1, eg the boot disk is either raided or
we're using a health checker script to stop the services if the boot disk fails. Worth mentioning
this in the docs.

I think it would make more sense to name the classes LocalDir* instead of Disk* since we're
checking local dirs and not disks. For example, we only check the given dirs so if there's
a failure on another sector of the disk it won't notice. The NM won't handle boot disk failures
even if it detects a failure on a dir hosted on the boot disk because it's dir-centric (ie
doesn't know that the disk has failed, just that a dir has). Similarly the local dirs and
log dirs may of course reside on the same disk so if we were checking disks we wouldn't need
to check them independently. The DN calls this "volume checking" for the same rationale, something
similar here would make sense as well. I'd call it LocalDirChecker and have it live in common
next to LocalDirAllocator. This way HDFS could re-use the code.

5% seems pretty low. How did you arrive at that? Are you sure you want a 12 disk host with
only 1 working disk to keep running?
> NodeManager should handle disk-failures
> ---------------------------------------
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>            Priority: Blocker
>             Fix For: 0.23.1
>         Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 3121.v2.patch
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact
of transient/permanent disk failures on containers. With larger number of disks per node,
the ability to continue to run containers on other disks is crucial.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message