hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM
Date Wed, 19 Oct 2011 19:39:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130909#comment-13130909
] 

Hitesh Shah commented on MAPREDUCE-2693:
----------------------------------------

bq. Do we need to remove the rack entries from ask and remoteRequestTable also? (The TODO
at the end) 

I don't believe we should be blacklisting a rack based on a single node's failure. This probably
needs a bit more thought in terms of how we decide to blacklist racks. Node failures could
be co-related to rack/switch failures. I updated the comment with some more information on
what we need to account for when blacklisting a rack and I will probably open a jira which
we can use a discussion board on what approach should we apply when trying to blacklist a
rack.

bq. getFilteredContainerRequest(): Why look for both IP addresses and host-names to check
if they are/aren't blacklisted? 

Had added that as there was some confusion in the code in terms of handling hostnames and
ips. Given that now containers are also using hostnames, all code in the allocator/requestor
has now been changed to use hostnames only. 

bq. Test: It is not clear to me why we need five iterations in that loop, is it possible to
make it deterministic or more explicit?

Was required as nodes blacklisted by AM could still be assigned back to it by the RM. Changed
the code around a bit to mark the blacklisted nodes as not healthy and make the test more
cleaner and deterministic. 

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once
this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs,
no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply
getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30
priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30
priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30
priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30
priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message