hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
Date Wed, 10 Jun 2015 19:18:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580965#comment-14580965
] 

zhihai xu commented on YARN-3790:
---------------------------------

[~rohithsharma], thanks for updating the title.
The containers are recovered. {{rootMetrics}}'s used resource is also updated, But {{rootMetrics}}'s
available resource is not updated.
The following logs in the failed test proved it:
{code}
2015-06-09 22:55:42,964 INFO  [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855))
- Added node 127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756))
- Processing event for application_1433915736884_0001 of type NODE_UPDATE
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820))
- Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287))
- The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:2147483647,
vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289))
- The updated fairshare for root.default is <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163))
- Counting resource from root.default <memory:0, vCores:0>; Total resource consumption
for root now <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298))
- Counting resource from application_1433915736884_0001 <memory:0, vCores:0>; Total
resource consumption for root.zxu now <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287))
- The updated demand for root.zxu is <memory:0, vCores:0>; the max is <memory:2147483647,
vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289))
- The updated fairshare for root.zxu is <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163))
- Counting resource from root.zxu <memory:0, vCores:0>; Total resource consumption for
root now <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177))
- The updated demand for root is <memory:0, vCores:0>; the max is <memory:2147483647,
vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196))
- The updated fairShare for root is <memory:8192, vCores:8>
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id
{ app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1
} id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority {
priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression:
""
2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000001 of
type RECOVER
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167))
- Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000001 Container Transitioned
from NEW to RUNNING
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756))
- Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154))
- Assigned container container_1433915736884_0001_01_000001 of capacity <memory:1024, vCores:1>
on host 127.0.0.1:1234, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168,
vCores:7> available after allocation
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_000001
is recovering container container_1433915736884_0001_01_000001
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id
{ app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1
} id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority {
priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression:
""
2015-06-09 22:55:42,966 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000002 of
type RECOVER
2015-06-09 22:55:42,966 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167))
- Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000002 Container Transitioned
from NEW to RUNNING
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756))
- Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154))
- Assigned container container_1433915736884_0001_01_000002 of capacity <memory:1024, vCores:1>
on host 127.0.0.1:1234, which has 2 containers, <memory:2048, vCores:2> used and <memory:6144,
vCores:6> available after allocation
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_000001
is recovering container container_1433915736884_0001_01_000002
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id
{ app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1
} id: 3 } container_state: C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority
{ priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 22:55:42,967 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000003 of
type RECOVER
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167))
- Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType:
CONTAINER_FINISHED
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000003 Container Transitioned
from NEW to COMPLETED
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781))
- Processing event for appattempt_1433915736884_0001_000001 of type CONTAINER_FINISHED
Container State is : RUNNING
Container State is : RUNNING
Received completed containers [ContainerStatus: [ContainerId: container_1433915736884_0001_01_000003,
State: COMPLETE, Diagnostics: recover container, ExitStatus: 0, ]]
2015-06-09 22:55:43,266 DEBUG [Thread-645] fair.FSAppAttempt (FSAppAttempt.java:getHeadroom(200))
- Headroom calculation for application_1433915736884_0001:Min((queueFairShare=<memory:8192,
vCores:8> - queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144,
vCores:6>Headroom=<memory:6144, vCores:6>
{code}
The Headroom calculation log show queueUsage=<memory:2048, vCores:2>, the code which
update  {{rootMetrics}}'s used resource is at {{AppSchedulingInfo#recoverContainer}}
{code}
metrics.allocateResources(user, 1, rmContainer.getAllocatedResource(), false);
{code}
And we can find the {{update}}/{{updateDemand}} is called by UpdateThread before {{recoverContainersOnNode}}.

The following is the log from the succeeded test.
{code}
2015-06-09 23:40:53,432 INFO  [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855))
- Added node 127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756))
- Processing event for application_1433918447351_0001 of type NODE_UPDATE
2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820))
- Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 23:40:53,432 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id
{ app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1
} id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority {
priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression:
""
2015-06-09 23:40:53,433 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000001 of
type RECOVER
2015-06-09 23:40:53,433 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000001 Container Transitioned
from NEW to RUNNING
2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167))
- Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 23:40:53,433 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154))
- Assigned container container_1433918447351_0001_01_000001 of capacity <memory:1024, vCores:1>
on host 127.0.0.1:1234, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168,
vCores:7> available after allocation
2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756))
- Processing event for application_1433918447351_0001 of type APP_RUNNING_ON_NODE
2015-06-09 23:40:53,434 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433918447351_0001_000001
is recovering container container_1433918447351_0001_01_000001
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id
{ app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1
} id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority {
priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression:
""
2015-06-09 23:40:53,435 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000002 of
type RECOVER
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000002 Container Transitioned
from NEW to RUNNING
2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167))
- Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154))
- Assigned container container_1433918447351_0001_01_000002 of capacity <memory:1024, vCores:1>
on host 127.0.0.1:1234, which has 2 containers, <memory:2048, vCores:2> used and <memory:6144,
vCores:6> available after allocation
2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756))
- Processing event for application_1433918447351_0001 of type APP_RUNNING_ON_NODE
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433918447351_0001_000001
is recovering container container_1433918447351_0001_01_000002
2015-06-09 23:40:53,436 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id
{ app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1
} id: 3 } container_state: C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority
{ priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 23:40:53,436 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000003 of
type RECOVER
2015-06-09 23:40:53,436 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl
(RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000003 Container Transitioned
from NEW to COMPLETED
2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167))
- Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType:
CONTAINER_FINISHED
2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781))
- Processing event for appattempt_1433918447351_0001_000001 of type CONTAINER_FINISHED
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287))
- The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:2147483647,
vCores:2147483647>
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289))
- The updated fairshare for root.default is <memory:0, vCores:0>
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163))
- Counting resource from root.default <memory:0, vCores:0>; Total resource consumption
for root now <memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298))
- Counting resource from application_1433918447351_0001 <memory:2048, vCores:2>; Total
resource consumption for root.zxu now <memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287))
- The updated demand for root.zxu is <memory:2048, vCores:2>; the max is <memory:2147483647,
vCores:2147483647>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289))
- The updated fairshare for root.zxu is <memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163))
- Counting resource from root.zxu <memory:2048, vCores:2>; Total resource consumption
for root now <memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177))
- The updated demand for root is <memory:2048, vCores:2>; the max is <memory:2147483647,
vCores:2147483647>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196))
- The updated fairShare for root is <memory:8192, vCores:8>
Container State is : RUNNING
Container State is : RUNNING
Received completed containers [ContainerStatus: [ContainerId: container_1433918447351_0001_01_000003,
State: COMPLETE, Diagnostics: recover container, ExitStatus: 0, ]]
2015-06-09 23:40:53,632 INFO  [Thread-643] resourcemanager.TestWorkPreservingRMRestart (TestWorkPreservingRMRestart.java:testSchedulerRecovery(240))
- testSchedulerRecovery end
2015-06-09 23:40:53,733 DEBUG [Thread-643] fair.FSAppAttempt (FSAppAttempt.java:getHeadroom(200))
- Headroom calculation for application_1433918447351_0001:Min((queueFairShare=<memory:8192,
vCores:8> - queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144,
vCores:6>Headroom=<memory:6144, vCores:6>
{code}
And we can find the {{update}}/{{updateDemand}} is called by UpdateThread after {{recoverContainersOnNode}}.
The test passed, because {{update}} set {{rootMetrics}}'s available resource correctly after
{{recoverContainersOnNode}} updated  {{rootMetrics}}'s used resource.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS
scheduler
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3790
>                 URL: https://issues.apache.org/jira/browse/YARN-3790
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith
>            Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec <<<
FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
 Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
> 	at org.junit.Assert.fail(Assert.java:88)
> 	at org.junit.Assert.failNotEquals(Assert.java:743)
> 	at org.junit.Assert.assertEquals(Assert.java:118)
> 	at org.junit.Assert.assertEquals(Assert.java:555)
> 	at org.junit.Assert.assertEquals(Assert.java:542)
> 	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
> 	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
> 	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message