Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 24 Jun 2015 16:28:05 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12838048.1434430195000.3776.1435163285728@Atlassian.JIRA>
In-Reply-To: <JIRA.12838048.1434430195000@Atlassian.JIRA>
References: <JIRA.12838048.1434430195000@Atlassian.JIRA>
 <JIRA.12838048.1434430195454@arcas>
Subject: [jira] [Updated] (YARN-3809) Failed to launch new attempts because
 ApplicationMasterLauncher's threads all hang
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated YARN-3809:
-----------------------------
    Hadoop Flags: Reviewed
      Issue Type: Bug  (was: Improvement)

> Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-3809
>                 URL: https://issues.apache.org/jira/browse/YARN-3809
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch
>
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)