hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuqi Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6959) RM may allocate wrong AM Container for new attempt
Date Thu, 10 Aug 2017 06:36:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121146#comment-16121146
] 

Yuqi Wang commented on YARN-6959:
---------------------------------

Yes, it is very rare. It is the first time I have seen in our large cluster.

The log was from our production cluster.
We have very larger cluster (>50k nodes) which serves daily batch jobs and long running
services from our customer in Microsoft.

Our customer complains that their job just fail without any effective retry/attempts.
Because as the log showed, the AM container size decreased from 20GB to 5GB, so the new attempt
will be definitively fail since pmem limitation is enabled.

As I said in this JIRA Description:
Concerns:
The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we should better
rename it to getCurrentApplicationAttempt. And reconsider whether there are any other bugs
related to getApplicationAttempt.



> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>
>                 Key: YARN-6959
>                 URL: https://issues.apache.org/jira/browse/YARN-6959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, fairscheduler, scheduler
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>              Labels: patch
>             Fix For: 2.7.1, 3.0.0-alpha4
>
>         Attachments: YARN-6959.001.patch, YARN-6959.002.patch, YARN-6959.003.patch, YARN-6959.004.patch,
YARN-6959.005.patch, YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests.
These mis-recorded ResourceRequests may confuse AM Container Request and Allocation for current
attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous AM asked
> // and there is not matching logic for the original AM Container ResourceRequest and

> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from different attempt
into different objects of SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, these ResourceRequests
will be recorded in old AppSchedulingInfo object which will not impact current attempt's resource
requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we should
better rename it to getCurrentApplicationAttempt. And reconsider whether there are any other
bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message