hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Chiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3607) Allow users to choose between failing the daemons vs failing the apps/containers
Date Tue, 23 Feb 2016 19:26:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159488#comment-15159488

Ray Chiang commented on YARN-3607:

Two suggestions:

1) Since this is a setting that affects all daemons, it makes sense to have one setting per
daemon type, such as yarn.resourcemanager.fail-fast and yarn.nodemanager.fail-fast.

2) There is going to be a lot of places in the YARN code where this variable could be checked.
 I'm thinking the first task/subtask would be to just add the variable definitions now and
then let the functionality be added where it's appropriate.

> Allow users to choose between failing the daemons vs failing the apps/containers
> --------------------------------------------------------------------------------
>                 Key: YARN-3607
>                 URL: https://issues.apache.org/jira/browse/YARN-3607
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager, resourcemanager, scheduler
>    Affects Versions: 2.7.0
>            Reporter: Karthik Kambatla
>            Assignee: Ray Chiang
> We often run into cases where we are faced with the option of failing the daemon (fail-fast)
vs failing user's work and keep the cluster running. There is no clear right way to handle
these situations - some users would like to be conservative and let the daemons run, while
others would like to fail-fast. 
> Today, we handle these case-by-case and go by what the people working on it feel is the
right way to handle things. Examples include how we handle app recovery failures, queue-changes
on RM restart. 
> Users should be able to choose between these two extremes, and have all these situations
handled the same way. 

This message was sent by Atlassian JIRA

View raw message