hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Gummadi (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures
Date Fri, 04 Nov 2011 09:47:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143873#comment-13143873
] 

Ravi Gummadi commented on MAPREDUCE-3121:
-----------------------------------------

Code level summary of the patch attached:

(1) Node Manager launches DiskHealthCheckeraService that periodically launches/executes the
disk-health-check-code. A new configuration property yarn.nodemanger.disk-health-checker.interval-ms
is added for the frequency with which this code is executed, with a default value of 120*1000ms(i.e.
2minutes).

(2) LocalStorage is a new class that manages a list of local file system directories and provides
api for checking the health of those directories (mostly similar to TaskTracker.LocalStorgae
class of 0.20 except that this class' checkDirs() doesn't throw DiskErrorException when all
directories fail but returns true if a new disk-failure is seen.

(3) DiskHealthCheckerService maintains 2 LocalStorage objects ---- one for nm-local-dirs and
second one for nm-log-dirs.

(4) ContainerExecutor is initialized with the DiskHealthCheckerService object. So both DefaultContainerExecutor.java
and LinuxContainerExecutor.java get good nm-local-dirs and nm-log-dirs from the DiskHealthChecker
always.

(5) container-executor binary gets good nm-local-dirs and good nm-log-dirs as a parameter
and uses these good dirs only. So these are removed from the configuration file(i.e. not needed
to be configured in container-executor.cfg configuration file).

(6) Whenever a new container gets launched, the good nm-local-dirs and good nm-log-dirs are
updated in the configuration sothat containers won't access bad disks. Everybody (localizer,
webserver) goes through DiskHealthChecker to access nm-local-dirs and nm-log-dirs.

(7) On the NodeManager web UI, NodeHealthReport will be showing the lost of good nm-local-dirs
and nm-log-dirs in addition to true/false about the health of the node.

(8) A new unit test TestDiskFailures is added that makes disks(both nm-local-dirs and nm-log-dirs)
fail and checks/validates if the NodeManager/DiskHealthChecker can identifies these disk-failures
or not.

\\
Tested the patch with (1) DefaultContainerExecutor and (2) LinuxContainerExecutor on my single
node cluster. The functionality seems to be working fine with disk failures getting identified
by NodeManager and the bad nm-local-dirs and bad nm-log-dirs getting avoided for new containers.
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.1
>
>         Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact
of transient/permanent disk failures on containers. With larger number of disks per node,
the ability to continue to run containers on other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message