hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures
Date Mon, 14 Nov 2011 21:04:54 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149914#comment-13149914
] 

Hitesh Shah commented on MAPREDUCE-3121:
----------------------------------------

Some comments: 
  - DISKS_FAILED 144 is probably not a good idea. It clashes with SIGUSR1. We could use EIO
or any other relevant exit code related to file system errors. Another option is to use a
non-clashing exit code along the lines of container aborted ( -100 ). Anyone have any preferences
on which approach to leverage? The latter will obviously be a more clear indicator on what
the failure was and allow easy blacklisting of this node/re-scheduling on other nodes.
  - Should the the failed disks error information be propagated into the app/container diagnostics?
  - Should there be a check for whether there are any good dirs left in ResourceLocalizationService
before starting of localizing the resources?
 
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.1
>
>         Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to minimize the impact
of transient/permanent disk failures on containers. With larger number of disks per node,
the ability to continue to run containers on other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message