hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures
Date Fri, 04 Nov 2011 21:45:51 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144374#comment-13144374

Hitesh Shah commented on MAPREDUCE-3121:

Was doing a brief review on the patch. A couple of points I saw - will go through the patch
in more detail later. 

1. I believe the return in ContainersLaunch#handle when no good disks are available requires
a ContainerExitedWithFailure event to be triggered for the container state machine to handle
the failure and clean up appropriately. When you do this, I think in the cleanup case, we
need to add a null check on rContainerDatum. Please add appropriate test cases to see that
the container state machine works correctly if all disks have failed.

2. A general comment about the handling of passing around of the diskchecker object or calling
of the getLocalDirs()/getLogDirs() functions. Given that there is a timer task involved that
will be updating the values in the background, there may various inconsistencies which may
crop up based on when the functions are called. 

For example, in the Linux Executor, the final command being constructed uses 2 calls: 
   - checker.getLocalDirsString()
   - checker.getLocalPaths()

There could be modifications done during the time that elapses between the 2 calls that may
create issues. Also, now the executor also needs to check if there is atleast one good disk

IMO, one approach we could take is to do a check at the top level. For example, ContainersLaunch
as per the patch does a sanity check on available disks and bails if there is an issue. It
could create a snapshot of available disks at this point 
and pass them to the ContainerLaunch which in turn will pass them on to the executor's launch
container call. This will also probably help in simplifying the code in that the low level
components no longer need to be worried about checking if there are any good disks available.


> NodeManager should handle disk-failures
> ---------------------------------------
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.1
>         Attachments: 3121.patch
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact
of transient/permanent disk failures on containers. With larger number of disks per node,
the ability to continue to run containers on other disks is crucial.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message