hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuqi Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6959) RM may allocate wrong AM Container for new attempt
Date Wed, 09 Aug 2017 02:44:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119334#comment-16119334
] 

Yuqi Wang commented on YARN-6959:
---------------------------------

[~jianhe]
Reproduce the race condition during below segment pipeline of one AM RM RPC call:
{code:java}
// One AM RM RPC call
ApplicationMasterService.allocate() {
  AllocateResponseLock lock = responseMap.get(appAttemptId);
  if (lock == null) { // MARK1: At this time, the appAttemptId is still current attempt, so
the RPC call continues.
    ...
    throw new ApplicationAttemptNotFoundException();
  }
  synchronized (lock) { // MARK2: The RPC call may be blocked here for a long time
    ...
    // MARK3: During MARK1 and here, RM may switch to the new attempt. So, previous 
    // attempt ResourceRequest may be recorded into current attempt ResourceRequests 
    scheduler.allocate(attemptId, ask, ...) -> scheduler.getApplicationAttempt(attemptId)
    ...
  }
}
{code}


I saw the log you mentioned. It shows that, RM switched to the new attempt and afterwards
there was still some allocate() from previous attempt came into the scheduler.
For details, I just attached the full log in the attachment, please check.
{code:java}
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e71_1500967702061_2512_01_000361 Container Transitioned from RUNNING to COMPLETED
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Completed container: container_e71_1500967702061_2512_01_000361 in state: COMPLETED event:FINISHED
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=hadoop	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1500967702061_2512
CONTAINERID=container_e71_1500967702061_2512_01_000361
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
prod-new used=<memory:0, vCores:0, ports:null> numContainers=9349 user=hadoop user-resources=<memory:0,
vCores:0, ports:null>
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId: container_e71_1500967702061_2512_01_000361,
NodeId: BN1APS0A410B91:10025, NodeHttpAddress: Proxy5.Yarn-Prod-Bn2.BN2.ap.gbl:81/proxy/nodemanager/BN1APS0A410B91/8042,
Resource: <memory:5120, vCores:1, ports:null>, Priority: 1, Token: Token { kind: ContainerToken,
service: 10.65.11.145:10025 }, ] queue=prod-new: capacity=0.7, absoluteCapacity=0.7, usedResources=<memory:0,
vCores:0, ports:null>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, numContainers=9349
cluster=<memory:261614761, vCores:79088, ports:null>
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0,
vCores:0, ports:null> cluster=<memory:261614761, vCores:79088, ports:null>
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.prod-new stats: prod-new: capacity=0.7, absoluteCapacity=0.7,
usedResources=<memory:0, vCores:0, ports:null>, usedCapacity=0.0, absoluteUsedCapacity=0.0,
numApps=6, numContainers=9349
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1500967702061_2512_000001 released container container_e71_1500967702061_2512_01_000361
on node: host: BN1APS0A410B91:10025 #containers=3 available=<memory:30977, vCores:23, ports:null>
used=<memory:23552, vCores:3, ports:null> with event: FINISHED
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Unregistering app attempt : appattempt_1500967702061_2512_000001
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
Application finished, removing password for appattempt_1500967702061_2512_000001
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1500967702061_2512_000001 State change from FINAL_SAVING to FAILED
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
The number of failed attempts is 1. The max attempts is 3
2017-07-31 21:29:38,354 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1500967702061_2512 State change from RUNNING to ACCEPTED
2017-07-31 21:29:38,354 INFO [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application Attempt appattempt_1500967702061_2512_000001 is done. finalState=FAILED
2017-07-31 21:29:38,354 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering app attempt : appattempt_1500967702061_2512_000002
2017-07-31 21:29:38,354 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1500967702061_2512_000002 State change from NEW to SUBMITTED
2017-07-31 21:29:38,354 INFO [ApplicationMasterLauncher #49] org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Cleaning master appattempt_1500967702061_2512_000001
{code}



> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>
>                 Key: YARN-6959
>                 URL: https://issues.apache.org/jira/browse/YARN-6959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, fairscheduler, scheduler
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>              Labels: patch
>             Fix For: 2.7.1, 3.0.0-alpha4
>
>         Attachments: YARN-6959.001.patch, YARN-6959.002.patch, YARN-6959.003.patch, YARN-6959.004.patch,
YARN-6959.005.patch, YARN-6959-branch-2.7.001.patch
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests.
These mis-recorded ResourceRequests may confuse AM Container Request and Allocation for current
attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous AM asked
> // and there is not matching logic for the original AM Container ResourceRequest and

> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from different attempt
into different objects of SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, these ResourceRequests
will be recorded in old AppSchedulingInfo object which will not impact current attempt's resource
requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we should
better rename it to getCurrentApplicationAttempt. And reconsider whether there are any other
bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message