Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AF6732009C6 for ; Mon, 16 May 2016 14:42:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AE04D1609B0; Mon, 16 May 2016 12:42:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 03F41160A16 for ; Mon, 16 May 2016 14:42:13 +0200 (CEST) Received: (qmail 50853 invoked by uid 500); 16 May 2016 12:42:13 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 50842 invoked by uid 99); 16 May 2016 12:42:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 May 2016 12:42:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0B0622C1F6A for ; Mon, 16 May 2016 12:42:13 +0000 (UTC) Date: Mon, 16 May 2016 12:42:13 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 16 May 2016 12:42:14 -0000 [ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284439#comment-15284439 ] Junping Du commented on YARN-4325: ---------------------------------- The left checkstyle issue is not valid. [~jlowe], mind to take a look at it again? > Purge app state from NM state-store should cover more LOG_HANDLING cases > ------------------------------------------------------------------------ > > Key: YARN-4325 > URL: https://issues.apache.org/jira/browse/YARN-4325 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: ApplicationImpl.PNG, YARN-4325-v1.1.patch, YARN-4325-v1.patch, YARN-4325-v2.patch, YARN-4325-v3.1.patch, YARN-4325-v3.patch, YARN-4325-v4.1.patch, YARN-4325-v4.patch, YARN-4325.patch > > > From a long running cluster, we found tens of thousands of stale apps still be recovered in NM restart recovery. > After investigating, there are three issues cause app state leak in NM state-store: > 1. APPLICATION_LOG_HANDLING_FAILED is not handled with remove App in NMStateStore. > 2. APPLICATION_LOG_HANDLING_FAILED event is missing in sent when hit aggregator's doAppLogAggregation() exception case. > 3. Only Application in FINISHED status receiving APPLICATION_LOG_FINISHED has transition to remove app in NM state store. Application in other status - like APPLICATION_RESOURCES_CLEANUP will ignore the event and later forget to remove this app from NM state store even after app get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org