Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 003CF100EE for ; Thu, 5 Feb 2015 03:14:15 +0000 (UTC) Received: (qmail 51527 invoked by uid 500); 5 Feb 2015 01:27:36 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 51465 invoked by uid 500); 5 Feb 2015 01:27:36 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 51450 invoked by uid 99); 5 Feb 2015 01:27:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Feb 2015 01:27:35 +0000 Date: Thu, 5 Feb 2015 01:27:35 +0000 (UTC) From: "Jun Gong (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3094) reset timer for liveness monitors after RM recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3094: --------------------------- Attachment: YARN-3094.5.patch The failed test case seems unrelated. Re-submit the same patch. > reset timer for liveness monitors after RM recovery > --------------------------------------------------- > > Key: YARN-3094 > URL: https://issues.apache.org/jira/browse/YARN-3094 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3094.2.patch, YARN-3094.3.patch, YARN-3094.4.patch, YARN-3094.5.patch, YARN-3094.patch > > > When RM restarts, it will recover RMAppAttempts and registry them to AMLivenessMonitor if they are not in final state. AM will time out in RM if the recover process takes long time due to some reasons(e.g. too many apps). > In our system, we found the recover process took about 3 mins, and all AM time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)