Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F6F295D9 for ; Mon, 28 Nov 2011 22:08:01 +0000 (UTC) Received: (qmail 81213 invoked by uid 500); 28 Nov 2011 22:08:01 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 81181 invoked by uid 500); 28 Nov 2011 22:08:01 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 81173 invoked by uid 99); 28 Nov 2011 22:08:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 22:08:01 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 22:08:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 548F5A55DF for ; Mon, 28 Nov 2011 22:07:40 +0000 (UTC) Date: Mon, 28 Nov 2011 22:07:40 +0000 (UTC) From: "Robert Joseph Evans (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <87734329.19863.1322518060347.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <38424299.5025.1322021920158.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158848#comment-13158848 ] Robert Joseph Evans commented on MAPREDUCE-3460: ------------------------------------------------ +1 to Hitesh's patch at least as a quick fix. I can try and reproduce the issue here and verify that the patch does indeed fix the issue. I can also add in a few unit tests for it and turn it into a real patch if you like. I would also like some feedback on a potential (long term) refactor of the code which would be done on a separate JIRA after 0.23 stabilizes. It seems to me that the root cause of this issue is because a special condition for a FAST_FAIL_MAP was missed. The code right now is written with lots of if else statements separating out map tasks from reduce tasks and also from failed map tasks, etc. I think it would be cleaner to replace the if statements with classes that use polymorphism to change the methods called. This would allow the different handling of a failed map from a normal map or from a reduce to be more evident. It would also force the internal data structures that keep track of the different types of tasks to be combined together. This is just something that popped into my head while trying to evaluate Hitesh's fix. I have not really evaluated what it would take to make it work or anything, I would just like some feedback about the idea before filing a JIRA a bout it. > MR AM can hang if containers are allocated on a node blacklisted by the AM > -------------------------------------------------------------------------- > > Key: MAPREDUCE-3460 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am, mrv2 > Affects Versions: 0.23.0 > Reporter: Siddharth Seth > Priority: Blocker > > When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to > find a corresponding container request. > This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list. > The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira