From mapreduce-issues-return-46228-apmail-hadoop-mapreduce-issues-archive=hadoop.apache.org@hadoop.apache.org Tue May 1 13:40:11 2012 Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A8BB89B21 for ; Tue, 1 May 2012 13:40:11 +0000 (UTC) Received: (qmail 39714 invoked by uid 500); 1 May 2012 13:40:11 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 39670 invoked by uid 500); 1 May 2012 13:40:11 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 39662 invoked by uid 99); 1 May 2012 13:40:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 May 2012 13:40:11 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 May 2012 13:40:10 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 5BB09429350 for ; Tue, 1 May 2012 13:39:50 +0000 (UTC) Date: Tue, 1 May 2012 13:39:50 +0000 (UTC) From: "Thomas Graves (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <38966059.13159.1335879590377.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (MAPREDUCE-4214) nodemanager should cleanup running containers when it starts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Thomas Graves created MAPREDUCE-4214: ---------------------------------------- Summary: nodemanager should cleanup running containers when it starts Key: MAPREDUCE-4214 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4214 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2, nodemanager Affects Versions: 0.23.3 Reporter: Thomas Graves Currently the nodemanager doesn't cleanup running containers when it gets restarted. This can cause containers to get lost and stick around forever. We've seen this happen multiple times when the RM is restarted. When the RM is brought back up, it doesn't know about what was running on the cluster, it tells the NMs to reboot and when the NM reboots it loses what it had running. If there are any containers that are behaving badly there is no one left that knows about them to kill them. We should kill any running containers when the nodemanager is being started. Note that when the NM is being brought up it needs to somehow figure out what containers were running and be sure it doesn't kill anything it shouldn't. Note, we should also try to kill any running containers when the node manager is shutting down (jira 4213 was filed for this). This might change a bit when RM restart is implemented if tasks can actually survive across RM/NM being rebooted, but that can be addressed at that point. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira