Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4635811CC7 for ; Mon, 5 May 2014 22:19:38 +0000 (UTC) Received: (qmail 34718 invoked by uid 500); 5 May 2014 22:19:24 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 34474 invoked by uid 500); 5 May 2014 22:19:20 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 34374 invoked by uid 99); 5 May 2014 22:19:18 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 May 2014 22:19:18 +0000 Date: Mon, 5 May 2014 22:19:18 +0000 (UTC) From: "Tsuyoshi OZAWA (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990050#comment-13990050 ] Tsuyoshi OZAWA commented on YARN-2019: -------------------------------------- This means that all RM can terminates when ZK cannot be accessed from RMs. If we should retry until ZK come up, one solution is handling STATE_STORE_OP_FAILED in RMFatalEventDispatcher and going into standby state. Please see an attached patch . > Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore > ------------------------------------------------------------------------------------ > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Junping Du > Priority: Critical > Labels: ha > Attachments: YARN-2019.1-wip.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal exception to crash RM down. As shown in YARN-1924, it could due to RM HA internal bug itself, but not fatal exception. We should retrospect some decision here as HA feature is designed to protect key component but not disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)