Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3666A200C17 for ; Fri, 6 Jan 2017 02:42:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 3555E160B42; Fri, 6 Jan 2017 01:42:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 826A6160B33 for ; Fri, 6 Jan 2017 02:42:00 +0100 (CET) Received: (qmail 99254 invoked by uid 500); 6 Jan 2017 01:41:59 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 98948 invoked by uid 99); 6 Jan 2017 01:41:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jan 2017 01:41:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 628D32C1F54 for ; Fri, 6 Jan 2017 01:41:59 +0000 (UTC) Date: Fri, 6 Jan 2017 01:41:59 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 06 Jan 2017 01:42:01 -0000 [ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3809: ----------------------------- Fix Version/s: 2.8.0 > Failed to launch new attempts because ApplicationMasterLauncher's threads all hang > ---------------------------------------------------------------------------------- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Jun Gong > Assignee: Jun Gong > Fix For: 2.8.0, 2.7.1, 3.0.0-alpha1 > > Attachments: YARN-3809.01.patch, YARN-3809.02.patch, YARN-3809.03.patch > > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org