Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80D6D18F89 for ; Wed, 13 Jan 2016 07:17:40 +0000 (UTC) Received: (qmail 60317 invoked by uid 500); 13 Jan 2016 07:17:40 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 60265 invoked by uid 500); 13 Jan 2016 07:17:40 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 60214 invoked by uid 99); 13 Jan 2016 07:17:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jan 2016 07:17:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E67D72C1F5D for ; Wed, 13 Jan 2016 07:17:39 +0000 (UTC) Date: Wed, 13 Jan 2016 07:17:39 +0000 (UTC) From: "Rohith Sharma K S (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095740#comment-15095740 ] Rohith Sharma K S commented on YARN-4497: ----------------------------------------- As a side node : since YARN-3840 removes the attempts from RMStateStore, it is very prone to get this issue (YARN-4584) nevertheless of *without RM HA is configured and fail fast is false*. About the solution, it is bit tricky to identify during recovery that *whether-application-is-failed-to-store* VS *failed-attempts-were-removed-after-interval*. So I think you can club both your solution and [~jianhe]'s thought together, so that we can eliminate *failed-attempts-were-removed-after-interval* attempts. And assume that attempts recovered are of failed to store only. Thoughts? Regarding iterating appState.attempts, it can be sorted before iterating it. If attempts are sorted, then there should not be problem with nextAttemptId. About the patch, # attempt.recoveredFinalStatus is being set to always to FAILED. These attempts might be KILLED/FINISHED also. # *getNumFailedAppAttempts()* is violated if attempt is failed to store since this attempt is removed from *attempts*. And also note that if attempts is failed to store, then many information such as getNumFailedAppAttempts also wont be exact number since attempt failure is taken from attempt. > RM might fail to restart when recovering apps whose attempts are missing > ------------------------------------------------------------------------ > > Key: YARN-4497 > URL: https://issues.apache.org/jira/browse/YARN-4497 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jun Gong > Assignee: Jun Gong > Priority: Critical > Attachments: YARN-4497.01.patch > > > Find following problem when discussing in YARN-3480. > If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)