Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D220102BA for ; Thu, 29 Aug 2013 00:25:53 +0000 (UTC) Received: (qmail 3540 invoked by uid 500); 29 Aug 2013 00:25:52 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 3507 invoked by uid 500); 29 Aug 2013 00:25:52 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 3498 invoked by uid 99); 29 Aug 2013 00:25:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Aug 2013 00:25:52 +0000 Date: Thu, 29 Aug 2013 00:25:52 +0000 (UTC) From: "Karthik Kambatla (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (MAPREDUCE-4955) NM container diagnostics for excess resource usage can be lost if task fails while being killed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned MAPREDUCE-4955: ------------------------------------------- Assignee: (was: Karthik Kambatla) > NM container diagnostics for excess resource usage can be lost if task fails while being killed > ------------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-4955 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4955 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.0.3-alpha, 0.23.5 > Reporter: Jason Lowe > > When a nodemanager kills a container for being over resource budgets, it provides a diagnostics message for the container status explaining why it was killed. However this message can be lost if the task fails during the shutdown from the SIGTERM (e.g.: lost DFS leases because filesystem closed) and notifies the AM via the task umbilical *before* the AM receives the NM's container status message via the RM heartbeat. > In that case the task attempt fails with the task's failure diagnostic, and the user is left wondering exactly why the task failed because the NM's diagnostics arrive too late, are not written to the history file, and are lost. If the AM receives the container status via the RM heartbeat before the task fails during shutdown then the diagnostics are written properly to the history file, and the user can see why the task failed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira