Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Tue, 15 Dec 2015 10:00:54 +0000 (UTC)
From: "Jun Gong (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12820539.1428933427000.25242.1450173654480@Atlassian.JIRA>
In-Reply-To: <JIRA.12820539.1428933427000@Atlassian.JIRA>
References: <JIRA.12820539.1428933427000@Atlassian.JIRA>
 <JIRA.12820539.1428933427856@arcas>
Subject: [jira] [Updated] (YARN-3480) Recovery may get very slow with lots
 of services with lots of app-attempts
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Gong updated YARN-3480:
---------------------------
    Attachment: YARN-3480.07.patch

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, YARN-3480.06.patch, YARN-3480.07.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)