Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1A6C7178FE for ; Fri, 8 May 2015 15:53:01 +0000 (UTC) Received: (qmail 62150 invoked by uid 500); 8 May 2015 15:53:01 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 62106 invoked by uid 500); 8 May 2015 15:53:01 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 62094 invoked by uid 99); 8 May 2015 15:53:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 May 2015 15:53:00 +0000 Date: Fri, 8 May 2015 15:53:00 +0000 (UTC) From: "Vinod Kumar Vavilapalli (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3480: ------------------------------------------ Summary: Recovery may get very slow with lots of services with lots of app-attempts (was: Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable) bq. Please see above. I think it will be OK for map-reduce jobs. But it might not be OK for service apps which have been running several months. Tx for explaining the scenario. Editing title to describe the problem instead as we are still discussing a solution. > Recovery may get very slow with lots of services with lots of app-attempts > -------------------------------------------------------------------------- > > Key: YARN-3480 > URL: https://issues.apache.org/jira/browse/YARN-3480 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch > > > When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. > BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)