Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1539E18C76 for ; Wed, 13 May 2015 10:05:00 +0000 (UTC) Received: (qmail 18443 invoked by uid 500); 13 May 2015 10:04:59 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 18396 invoked by uid 500); 13 May 2015 10:04:59 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 18384 invoked by uid 99); 13 May 2015 10:04:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 10:04:59 +0000 Date: Wed, 13 May 2015 10:04:59 +0000 (UTC) From: "Xianyin Xin (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Xianyin Xin created YARN-3639: --------------------------------- Summary: It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Assignee: Xianyin Xin If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)