Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28D1ADBC6 for ; Fri, 21 Dec 2012 22:19:14 +0000 (UTC) Received: (qmail 32420 invoked by uid 500); 21 Dec 2012 22:19:14 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 32247 invoked by uid 500); 21 Dec 2012 22:19:13 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 32237 invoked by uid 99); 21 Dec 2012 22:19:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Dec 2012 22:19:13 +0000 Date: Fri, 21 Dec 2012 22:19:13 +0000 (UTC) From: "Jason Lowe (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-4833) Task can get stuck in FAIL_CONTAINER_CLEANUP MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538472#comment-13538472 ] Jason Lowe commented on MAPREDUCE-4833: --------------------------------------- +1, thanks for writing a test. > Task can get stuck in FAIL_CONTAINER_CLEANUP > -------------------------------------------- > > Key: MAPREDUCE-4833 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4833 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.5 > Reporter: Robert Joseph Evans > Assignee: Robert Parker > Priority: Critical > Attachments: MAPREDUCE4833-1.patch, MAPREDUCE4833-2.patch, MAPREDUCE4833.patch > > > If an NM goes down and the AM still tries to launch a container on it the ContainerLauncherImpl can get stuck in an RPC timeout. At the same time the RM may notice that the NM has gone away and inform the AM of this, this triggers a TA_FAILMSG. If the TA_FAILMSG arrives at the TaskAttemptImpl before the TA_CONTAINER_LAUNCH_FAILED message then the task attempt will try to kill the container, but the ContainerLauncherImpl will not send back a TA_CONTAINER_CLEANED event causing the attempt to be stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira