hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures
Date Mon, 14 Nov 2011 21:04:54 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149914#comment-13149914

Hitesh Shah commented on MAPREDUCE-3121:

Some comments: 
  - DISKS_FAILED 144 is probably not a good idea. It clashes with SIGUSR1. We could use EIO
or any other relevant exit code related to file system errors. Another option is to use a
non-clashing exit code along the lines of container aborted ( -100 ). Anyone have any preferences
on which approach to leverage? The latter will obviously be a more clear indicator on what
the failure was and allow easy blacklisting of this node/re-scheduling on other nodes.
  - Should the the failed disks error information be propagated into the app/container diagnostics?
  - Should there be a check for whether there are any good dirs left in ResourceLocalizationService
before starting of localizing the resources?
> NodeManager should handle disk-failures
> ---------------------------------------
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.1
>         Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact
of transient/permanent disk failures on containers. With larger number of disks per node,
the ability to continue to run containers on other disks is crucial.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message