Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A588941E for ; Wed, 19 Oct 2011 19:39:32 +0000 (UTC) Received: (qmail 60368 invoked by uid 500); 19 Oct 2011 19:39:32 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 60344 invoked by uid 500); 19 Oct 2011 19:39:32 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 60334 invoked by uid 99); 19 Oct 2011 19:39:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2011 19:39:32 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2011 19:39:30 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id A4B82312C96 for ; Wed, 19 Oct 2011 19:39:10 +0000 (UTC) Date: Wed, 19 Oct 2011 19:39:10 +0000 (UTC) From: "Hitesh Shah (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <494564629.11887.1319053150676.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1836354880.19050.1310769242028.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130909#comment-13130909 ] Hitesh Shah commented on MAPREDUCE-2693: ---------------------------------------- bq. Do we need to remove the rack entries from ask and remoteRequestTable also? (The TODO at the end) I don't believe we should be blacklisting a rack based on a single node's failure. This probably needs a bit more thought in terms of how we decide to blacklist racks. Node failures could be co-related to rack/switch failures. I updated the comment with some more information on what we need to account for when blacklisting a rack and I will probably open a jira which we can use a discussion board on what approach should we apply when trying to blacklist a rack. bq. getFilteredContainerRequest(): Why look for both IP addresses and host-names to check if they are/aren't blacklisted? Had added that as there was some confusion in the code in terms of handling hostnames and ips. Given that now containers are also using hostnames, all code in the allocator/requestor has now been changed to use hostnames only. bq. Test: It is not clear to me why we need five iterations in that loop, is it possible to make it deterministic or more explicit? Was required as nodes blacklisted by AM could still be assigned back to it by the RM. Changed the code around a bit to mark the blacklisted nodes as not healthy and make the test more cleaner and deterministic. > NPE in AM causes it to lose containers which are never returned back to RM > -------------------------------------------------------------------------- > > Key: MAPREDUCE-2693 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Reporter: Amol Kekre > Assignee: Hitesh Shah > Priority: Critical > Fix For: 0.23.0 > > Attachments: MR-2693.1.patch, MR-2693.2.patch > > > The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining > containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens > because of these lost containers. > It happens when there are blacklisted nodes at the app level in AM. A bug in AM > (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the > request-table. We should make sure RM also knows about this update. > ======================================================================== > 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20 > resourceName=... numContainers=4978 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20 > resourceName=... numContainers=4977 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20 > resourceName=... numContainers=1540 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20 > resourceName=... numContainers=1539 #asks=6 > 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. > java.lang.NullPointerException > at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246) > at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433) > at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151) > at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220) > at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira