Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 987F418F7A for ; Thu, 21 Jan 2016 22:28:40 +0000 (UTC) Received: (qmail 3239 invoked by uid 500); 21 Jan 2016 22:28:40 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 3138 invoked by uid 500); 21 Jan 2016 22:28:40 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 3096 invoked by uid 99); 21 Jan 2016 22:28:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jan 2016 22:28:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 046BF2C1F70 for ; Thu, 21 Jan 2016 22:28:40 +0000 (UTC) Date: Thu, 21 Jan 2016 22:28:40 +0000 (UTC) From: "Jian He (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111507#comment-15111507 ] Jian He commented on YARN-4497: ------------------------------- looks good to me, minor comments is I think setRecoveredFinalState and getRecoveredFinalState does not need to acquire the lock, as they happen sequentially. this code can be formatted into single lines like below. {code} if (preAttempt != null && preAttempt.getRecoveredFinalState() == null) { preAttempt.setRecoveredFinalState(RMAppAttemptState.FAILED); } {code} > RM might fail to restart when recovering apps whose attempts are missing > ------------------------------------------------------------------------ > > Key: YARN-4497 > URL: https://issues.apache.org/jira/browse/YARN-4497 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jun Gong > Assignee: Jun Gong > Priority: Critical > Attachments: YARN-4497.01.patch, YARN-4497.02.patch > > > Find following problem when discussing in YARN-3480. > If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)