hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hu Ziqian (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues
Date Mon, 07 Jan 2019 03:45:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735424#comment-16735424

Hu Ziqian commented on YARN-9163:

[~cheersyang], actually we used a internal version of  hadoop based on 2.8 while backport
global scheduler in it. I'm not sure which community's version matches it.

> Deadlock when use yarn rmadmin -refreshQueues
> ---------------------------------------------
>                 Key: YARN-9163
>                 URL: https://issues.apache.org/jira/browse/YARN-9163
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Hu Ziqian
>            Assignee: Hu Ziqian
>            Priority: Blocker
>         Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
> We have a cluster with 4000+ node and 10w+ app per-day in our production environment.
When we use CLI: yarn rmadmin -refreshQueues, the active rm's process is stuck and ha doesn't
happen, which means all the cluster stops service and we can only fix it by reboot active
rm. We can reproduce on our production cluster every time but can't reproduce in our test
environment which only has 100+ nodes and few apps. Both of our production and test environment
use CapacityScheduler which open asyncSchedule function and preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request blocks
future readLock despite policy unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In
order to solve this problem, we change the logic of  refreshqueue thread, get a queue info
copy first and avoid the thread to take write lock of preemptionManager  and read lock
of root queue at the same time.
> We test our new code in our production environment and the refresh queue command works

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message