hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6901) A CapacityScheduler app->LeafQueue deadlock found in branch-2.8
Date Mon, 31 Jul 2017 23:01:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108126#comment-16108126
] 

Jason Lowe commented on YARN-6901:
----------------------------------

As I understand the stacktraces, we have the preemption monitor thread who has locked a leaf
queue and is in the process of trying to lock the application within the queue.  Then we have
the scheduling thread which is calling LeafQueue's assignContainers and ultimately locks an
application then tries to lock the leaf queue for that application.  Is that correct?  If
so I'm wondering how we're deadlocking since the thread calling LeafQueue#assignContainers
should already have the lock on the leaf queue.  Maybe I missed a spot since I couldn't line
up the line numbers in the stack trace with branch-2.8 source, but everywhere I saw LeafQueue's
assignContainers calling the application's assignContainers method it was within the context
of a {{synchronized(this)}} block.

Now I'm wondering how we did _not_ have the lock when trying to get the leaf queue path. 
Are we somehow trying to examine an allocation for some other queue, or does the queue think
this application is part of this queue but the app thinks it is part of some other queue?
 I'm probably missing something basic, but I'm not sure why we don't already have the leaf
queue locked when FiCaSchedulerApp#assignContainers is called.

On a related note, I found it odd that LeafQueue#assignContainers explicitly locks the app
before callling assignContainers on it when it is trying to assign a reserved container but
I didn't see a similar lock on the application before calling assignContainers for the non-reserved
case.  Seems like we're either missing a lock we need or grabbing a lock unnecessarily.

> A CapacityScheduler app->LeafQueue deadlock found in branch-2.8 
> ----------------------------------------------------------------
>
>                 Key: YARN-6901
>                 URL: https://issues.apache.org/jira/browse/YARN-6901
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.8.0
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Blocker
>         Attachments: YARN-6901.branch-2.8.001.patch
>
>
> Stacktrace:
> {code}
> Thread 22068: (state = BLOCKED)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getParent()
@bci=0, line=185 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueuePath()
@bci=8, line=262 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator.getCSAssignmentFromAllocateResult(org.apache.hadoop.yarn.api.records.Resource,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocation,
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer) @bci=183, line=80 (Compiled
frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(org.apache.hadoop.yarn.api.records.Resource,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode, org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.SchedulingMode,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceLimits, org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer)
@bci=204, line=747 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(org.apache.hadoop.yarn.api.records.Resource,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode, org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.SchedulingMode,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceLimits, org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer)
@bci=16, line=49 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(org.apache.hadoop.yarn.api.records.Resource,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode, org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceLimits,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.SchedulingMode, org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer)
@bci=61, line=468 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(org.apache.hadoop.yarn.api.records.Resource,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode, org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceLimits,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.SchedulingMode) @bci=148,
line=876 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode)
@bci=157, line=1149 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEvent)
@bci=266, line=1277 (Compiled frame)
> ================
>  Thread 22124: (state = BLOCKED)
>  - org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getReservedContainers()
@bci=0, line=336 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.preemptFrom(org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp,
org.apache.hadoop.yarn.api.records.Resource, java.util.Map, java.util.List, org.apache.hadoop.yarn.api.records.Resource,
java.util.Map, org.apache.hadoop.yarn.api.records.Resource) @bci=61, line=277 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(java.util.Map,
org.apache.hadoop.yarn.api.records.Resource, org.apache.hadoop.yarn.api.records.Resource)
@bci=374, line=138 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue,
org.apache.hadoop.yarn.api.records.Resource) @bci=264, line=342 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule()
@bci=34, line=202 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy()
@bci=4, line=81 (Compiled frame)
>  - org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run()
@bci=23, line=92 (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message