Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C62790BA for ; Thu, 27 Oct 2011 13:12:55 +0000 (UTC) Received: (qmail 26250 invoked by uid 500); 27 Oct 2011 13:12:55 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 26210 invoked by uid 500); 27 Oct 2011 13:12:55 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 26194 invoked by uid 99); 27 Oct 2011 13:12:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2011 13:12:55 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2011 13:12:52 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 00E653228F7 for ; Thu, 27 Oct 2011 13:10:33 +0000 (UTC) Date: Thu, 27 Oct 2011 13:10:33 +0000 (UTC) From: "Vinod Kumar Vavilapalli (Updated) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1911709346.24629.1319721033005.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1649089481.15024.1319112790745.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (MAPREDUCE-3228) MR AM hangs when one node goes bad MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3228: ----------------------------------------------- Status: Patch Available (was: Open) > MR AM hangs when one node goes bad > ---------------------------------- > > Key: MAPREDUCE-3228 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.0 > Reporter: Vinod Kumar Vavilapalli > Assignee: Vinod Kumar Vavilapalli > Priority: Blocker > Fix For: 0.23.0 > > Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt > > > Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever. > Thanks to [~Karams] for helping with this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira