hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved container
Date Wed, 28 Dec 2016 00:11:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781593#comment-15781593
] 

Wangda Tan commented on YARN-6029:
----------------------------------

Thanks [~Tao Yang] for reporting this issue.

[~Naganarasimha], branch-2/trunk solves the problem after YARN-5706. 

However to fix the issue, backporting of YARN-5706 needs huge effort. I don't think it is
even a plan. 

We can make some changes to LeafQueue:

1. Remove synchronized lock of assignContainers
2. Make changes:

{code}
# BEGINNING of LeafQueue#assignContainers
synchronized {
   // do stuffs
}

call-complete-containers (which locks parent) 

synchronized {
   // do rest stuffs
}
# END of LeafQueue#assignContainers
{code}

Removing synchronized will cause data inconsistency issue when fetch, and there're some other
possible methods with the same pattern need change as well. (Grab LeafQueue lock while holding
ParentQueue lock and do not grab CapacityScheduler's lock). 

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A
at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved container
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6029
>                 URL: https://issues.apache.org/jira/browse/YARN-6029
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Blocker
>         Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls YarnClient#getQueueAclsInfo)
just at the moment that LeafQueue#assignContainers is called and before notifying parent queue
to release resource (should release a reserved container), then ResourceManager can deadlock.
I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized LeafQueue#assignContainers
(got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized ParentQueue#getQueueUserAclInfo
(got ParentQueue instance lock of queue root), iterates over children queue acls and is blocked
when calling synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of queue
root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being completed and
is blocked when invoking synchronized ParentQueue#internalReleaseResource method (the ParentQueue
instance lock of queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be removed to
solve this problem, since this method appears to not affect fields of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message