hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5918) Opportunistic scheduling allocate request failure when NM lost
Date Wed, 23 Nov 2016 13:08:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690057#comment-15690057

Varun Saxena commented on YARN-5918:

Thanks [~bibinchundatt] for the latest patch.

We actually did not need such an elaborate test. We could have probably achieved the same
by mocking the scheduler to simulate case of NPE. But its not necessarily bad to have some
sort of E2E test case simulating the complete scenario.

Few comments.
# This test doesn't seem to belong to TestNodeQueueLoadMonitor. We can shift it to some other
test class or create a new class if it cant fit in any.
# We are iterating 30 times over and sleeping for 100 ms in each run. This will unnecessarily
make the test run for 3 seconds in loop. We can get rid of this loop and make run time of
test faster. We can do it as under.
** You can set YarnConfiguration.NM_CONTAINER_QUEUING_SORTING_NODES_INTERVAL_MS in configuration
as 100 ms instead of 1000 ms default.
** Move node added and node update events sent to AM Service outside the loop.
** We can check get OpportunisticContainerContext from FiCaSchedulerApp and use it to check
against getNodeMap to have a deterministic test case and reduce test time.
** After invoking node add and update events, we can loop 10-20 times over(say) to send allocate
with a sleep of say 50 ms. Break from the loop as soon as getNodeMap has 2 nodes. Now send
remove node event to scheduler and then loop over to send allocate and wait till getNodeMap
becomes 1. 
# Not related to your patch. In NodeQueueLoadMonitor we have some LOG.debug statements without
isDebugEnabled guard. Maybe we can fix this here as well.

> Opportunistic scheduling allocate request failure when NM lost
> --------------------------------------------------------------
>                 Key: YARN-5918
>                 URL: https://issues.apache.org/jira/browse/YARN-5918
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-5918.0001.patch, YARN-5918.0002.patch
> Allocate request failure during Opportunistic container allocation when nodemanager is
> {noformat}
> 2016-11-20 10:38:49,011 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=root     OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1479637990302_0002
   CONTAINERID=container_e12_1479637990302_0002_01_000006  RESOURCE=<memory:1024, vCores:1>
> 2016-11-20 10:38:49,011 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Removed node docker2:38297 clusterResource: <memory:4096, vCores:8>
> 2016-11-20 10:38:49,434 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8030,
call Call#35 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNode(OpportunisticContainerAllocatorAMService.java:420)
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNodes(OpportunisticContainerAllocatorAMService.java:412)
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.getLeastLoadedNodes(OpportunisticContainerAllocatorAMService.java:402)
>         at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.allocate(OpportunisticContainerAllocatorAMService.java:236)
>         at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2539)
> 2016-11-20 10:38:50,824 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e12_1479637990302_0002_01_000002 Container Transitioned from RUNNING to COMPLETED
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message