hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1694) RM is shutting down when an NM is added to cluster without updating the hostname in /etc/hosts
Date Fri, 07 Feb 2014 09:38:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894345#comment-13894345
] 

Sunil G commented on YARN-1694:
-------------------------------

A quick fix possible is to suppress this exception and then try not to allocate any resource
here in this Node.
RM can continue functioning normally.
If we can just suppress this exception while resolving Ip to get container token, it will
not affect in NODE_UPDATE call.

      Token containerToken = null;
      try {
        containerToken = createContainerToken(application, container);
      } catch (Throwable t) {
        // Handle any internal exception and continue normal...
        return Resources.none();
      }


But on a second thought, could a valid node check (w.r.t resolve node Ip) in ResourceTrackerService
is a good idea or not?
Then can try shutdown this NM.

Pls share your thoughts.

> RM is shutting down when an NM is added to cluster without updating the hostname in /etc/hosts
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1694
>                 URL: https://issues.apache.org/jira/browse/YARN-1694
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: Sunil G
>            Priority: Critical
>
> A New NM is added to cluster, but the hostname mapping of this NM is not updated in /etc/hosts
in RM.
> NM registration is successful without any problems.
> When a job is submitted, RM shuts down with below exception.
> 2013-10-04 04:37:37,611 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: java.net.UnknownHostException: host-10-18-40-120
>         at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
>         at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247)
>         at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:195)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.createContainerToken(LeafQueue.java:1296)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1344)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1210)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1169)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:870)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:645)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:707)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:751)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:93)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:449)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.UnknownHostException: host-10-18-40-120
>         ... 15 more
> 2013-10-04 04:37:37,614 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Exiting, bbye..



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message