Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Tue, 21 Jul 2015 10:58:05 +0000 (UTC)
From: "Junping Du (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12712078.1399075899000.242690.1437476285394@Atlassian.JIRA>
In-Reply-To: <JIRA.12712078.1399075899000@Atlassian.JIRA>
References: <JIRA.12712078.1399075899000@Atlassian.JIRA>
 <JIRA.12712078.1399075899763@arcas>
Subject: [jira] [Commented] (YARN-2019) Retrospect on decision of making RM
 crashed if any exception throw in ZKRMStateStore
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634951#comment-14634951 ] 

Junping Du commented on YARN-2019:
----------------------------------

+1 on general idea of YARN-3607. However, here users may have three options actually when facing error of ZKRMStateStore:
1. aggressive to fail RM daemon;
2. conservative to only log these errors without failed RM daemon and any applications;
3. relative conservative - not failed RM but failed application in some cases (like RM get restarted).
These choices may hint we may not want to force the policy of handling on all failures into a single configuration, although I agree we should combine/consolidate them as many as possible like what proposed by YARN-3607. 
Particularly in this case, I may prefer to add a separated configuration (may be something like: a boolean value for "yarn.resourcemanager.state-store.exit-on-error" or an enum value for "yarn.resourcemanager.state-store.policy-on-error"?) to allow user to choose when facing RM state store failures. So user got other options for other failure cases.

> Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Jian He
>            Priority: Critical
>              Labels: ha
>         Attachments: YARN-2019.1-wip.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal exception to crash RM down. As shown in YARN-1924, it could due to RM HA internal bug itself, but not fatal exception. We should retrospect some decision here as HA feature is designed to protect key component but not disturb it.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)