hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5918) Opportunistic scheduling allocate request failure when NM lost
Date Sun, 20 Nov 2016 20:24:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681783#comment-15681783
] 

Varun Saxena commented on YARN-5918:
------------------------------------

While making null checks fixes the NPE, can something else be done or needs to be done ? 
If we fix code as above, we will return less nodes for scheduling opportunistic containers
than yarn.opportunistic-container-allocation.nodes-used configuration even though enough nodes
are available. But this should be updated the very next second (as per default config) which
maybe fine.

Cluster nodes are sorted in NodeQueueLoadMonitor every 1 second by default and stored in a
list. Although we remove node when a node is lost from cluster nodes, we do not remove it
from sorted nodes. Because for doing it we will have to iterate over the list. Can we keep
a set instead ? Also when we get least loaded nodes when allocate request comes, we simply
create a sublist from the sorted nodes. We can potentially iterate over the list and check
if node is still running or not to avoid NPE but this would be slower than creating a sublist
especially number of nodes configured for scheduling opportunistic containers are way larger
than default of 10.

I guess we can check with guys working on distributed scheduling before deciding on a fix.
cc [~asuresh] 

> Opportunistic scheduling allocate request failure when NM lost
> --------------------------------------------------------------
>
>                 Key: YARN-5918
>                 URL: https://issues.apache.org/jira/browse/YARN-5918
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-5918.0001.patch
>
>
> Allocate request failure during Opportunistic container allocation when nodemanager is
lost 
> {noformat}
> 2016-11-20 10:38:49,011 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=root     OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1479637990302_0002
   CONTAINERID=container_e12_1479637990302_0002_01_000006  RESOURCE=<memory:1024, vCores:1>
> 2016-11-20 10:38:49,011 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Removed node docker2:38297 clusterResource: <memory:4096, vCores:8>
> 2016-11-20 10:38:49,434 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8030,
call Call#35 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from
172.17.0.2:51584
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNode(OpportunisticContainerAllocatorAMService.java:420)
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNodes(OpportunisticContainerAllocatorAMService.java:412)
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.getLeastLoadedNodes(OpportunisticContainerAllocatorAMService.java:402)
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.allocate(OpportunisticContainerAllocatorAMService.java:236)
>         at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2539)
> 2016-11-20 10:38:50,824 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e12_1479637990302_0002_01_000002 Container Transitioned from RUNNING to COMPLETED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message