hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4790) Per user blacklist node for user specific error for container launch failure.
Date Fri, 11 Mar 2016 17:01:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191217#comment-15191217

Vinod Kumar Vavilapalli commented on YARN-4790:

I agree with the problem statement but not necessarily the proposal. Please edit the title
so that it highlights the problem only so that we can figure out whatever the solution is.

What we need is to *not* penalize applications for system related issues. When YARN finds
a node with configuration / permission issues, it should itself take an action to (a) avoid
scheduling on that node, (b) alert administrators etc.

Implementing heuristics for app / user level blacklisting to work-around platform problems
should be a last-ditch effort. We did that in Hadoop 1 MapReduce as we didn't have clear demarcation
between app vs system failures. But that isn't the case with YARN - part of the reason why
we never implemented heuristics based per-app blacklisting *in YARN* - we left that completely
up to applications.

> Per user blacklist node for user specific error for container launch failure.
> -----------------------------------------------------------------------------
>                 Key: YARN-4790
>                 URL: https://issues.apache.org/jira/browse/YARN-4790
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications
>            Reporter: Junping Du
>            Assignee: Junping Du
> There are some user specific error for container launch failure, like:
> when enabling LinuxContainerExecutor, but some node doesn't have such user exists, so
container launch should get failed with following information:
> {noformat}
> 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:

> appattempt_1434045496283_0036_000002 State change from LAUNCHED to FAILED 
> 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Application application_1434045496283_0036 failed 2 times due to AM Container for 
> appattempt_1434045496283_0036_000002 exited with exitCode: -1000 due to: 
> Application application_1434045496283_0036 initialization failed (exitCode=255) with
output: User jdu not found 
> {noformat}
> Obviously, this node is not suitable for launching container for this user's other applications.
We need a per user blacklist track mechanism rather than per application now.

This message was sent by Atlassian JIRA

View raw message