Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 08AAF200BFB for ; Wed, 28 Dec 2016 04:52:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 07466160B31; Wed, 28 Dec 2016 03:52:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5072F160B3D for ; Wed, 28 Dec 2016 04:52:00 +0100 (CET) Received: (qmail 19908 invoked by uid 500); 28 Dec 2016 03:51:59 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 19689 invoked by uid 99); 28 Dec 2016 03:51:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Dec 2016 03:51:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A2DBA2C2A68 for ; Wed, 28 Dec 2016 03:51:58 +0000 (UTC) Date: Wed, 28 Dec 2016 03:51:58 +0000 (UTC) From: "Tao Yang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved container MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 28 Dec 2016 03:52:01 -0000 [ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781969#comment-15781969 ] Tao Yang commented on YARN-6029: -------------------------------- Thanks [~Naganarasimha] [~djp] [~leftnoteasy] for your suggestions. [~Naganarasimha] I think there maybe have a problem when iterating childQueues and at the same time ParentQueue#setChildQueues is called. [~leftnoteasy] I agree your solution solves the problem. But I still think synchronized modifier of LeafQueue#getQueueUserAclInfo is not required. In my opinion, This method doesn't affect the data structure of LeafQueue instance (check permissions of the given user, create new QueueUserACLInfo instance then return.), and it's only called by ParentQueue#getQueueUserAclInfo. By the way, take FairScheduler as a reference, FSLeafQueue#getQueueUserAclInfo is not synchronized. Maybe I haven't realized the potential problem, Please correct me if I am wrong. > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved container > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 2.8.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Blocker > Attachments: YARN-6029.001.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls YarnClient#getQueueAclsInfo) just at the moment that LeafQueue#assignContainers is called and before notifying parent queue to release resource (should release a reserved container), then ResourceManager can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue root), iterates over children queue acls and is blocked when calling synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being completed and is blocked when invoking synchronized ParentQueue#internalReleaseResource method (the ParentQueue instance lock of queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be removed to solve this problem, since this method appears to not affect fields of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org