hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lohit Vijayarenu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1032) NPE in RackResolve
Date Mon, 05 Aug 2013 21:30:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729977#comment-13729977
] 

Lohit Vijayarenu commented on YARN-1032:
----------------------------------------

Once we hit exception in RackResolver, since this is not caught or default rack is not returned,
this is end up not releasing containers which could not be assigned in RMContainerAllocator.java

{noformat}

      assignContainers(allocatedContainers);
       
      // release container if we could not assign it 
      it = allocatedContainers.iterator();
      while (it.hasNext()) {
        Container allocated = it.next();
        LOG.info("Releasing unassigned and invalid container " 
            + allocated + ". RM may have assignment issues");
        containerNotAssigned(allocated);
      }
{noformat}

AM would no longer ask for new containers since it thinks containers are assigned and RM assumes
containers are allocated to AM. Job ends up hanging forever without making any progress. Fixing
releasing containers might be part of another JIRA, at the minimum we need to catch exception
and return default rack incase of failure. 
                
> NPE in RackResolve
> ------------------
>
>                 Key: YARN-1032
>                 URL: https://issues.apache.org/jira/browse/YARN-1032
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.0.5-alpha
>         Environment: linux
>            Reporter: Lohit Vijayarenu
>            Priority: Minor
>
> We found a case where our rack resolve script was not returning rack due to problem with
resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught
in RMContainerAllocator. 
> {noformat}
> 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99)
> 	at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243)
> 	at java.lang.Thread.run(Thread.java:722)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message