Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CDF771881F for ; Tue, 21 Jul 2015 10:58:11 +0000 (UTC) Received: (qmail 87564 invoked by uid 500); 21 Jul 2015 10:58:05 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 87522 invoked by uid 500); 21 Jul 2015 10:58:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 87508 invoked by uid 99); 21 Jul 2015 10:58:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 10:58:05 +0000 Date: Tue, 21 Jul 2015 10:58:05 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634951#comment-14634951 ] Junping Du commented on YARN-2019: ---------------------------------- +1 on general idea of YARN-3607. However, here users may have three options actually when facing error of ZKRMStateStore: 1. aggressive to fail RM daemon; 2. conservative to only log these errors without failed RM daemon and any applications; 3. relative conservative - not failed RM but failed application in some cases (like RM get restarted). These choices may hint we may not want to force the policy of handling on all failures into a single configuration, although I agree we should combine/consolidate them as many as possible like what proposed by YARN-3607. Particularly in this case, I may prefer to add a separated configuration (may be something like: a boolean value for "yarn.resourcemanager.state-store.exit-on-error" or an enum value for "yarn.resourcemanager.state-store.policy-on-error"?) to allow user to choose when facing RM state store failures. So user got other options for other failure cases. > Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore > ------------------------------------------------------------------------------------ > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Junping Du > Assignee: Jian He > Priority: Critical > Labels: ha > Attachments: YARN-2019.1-wip.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal exception to crash RM down. As shown in YARN-1924, it could due to RM HA internal bug itself, but not fatal exception. We should retrospect some decision here as HA feature is designed to protect key component but not disturb it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)