hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wilfred Spiegelenburg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
Date Mon, 19 Oct 2015 07:01:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962935#comment-14962935
] 

Wilfred Spiegelenburg commented on YARN-4227:
---------------------------------------------

I dug through the logs and it looks similar to [YARN-3675] however the time lag is far larger
and there are multiple allocations happening in between the removal of the node and the allocation.
These allocations happen on other nodes. I would have expected that the node removal would
have been processed in that time. 

{code}
2015-09-11 23:47:52,589 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_e11_1438912039098_1027475_01_000012 of capacity <memory:6144,
vCores:1> on host XXXX.com:8041, which currently has 0 containers, <memory:0, vCores:0>
used and <memory:221184, vCores:36> available, release resources=true
2015-09-11 23:47:52,589 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application attempt appattempt_1438912039098_1027475_000001 released container container_e11_1438912039098_1027475_01_000012
on node: host: XXXX.com:8041 #containers=0 available=221184 used=0 with event: KILL
2015-09-11 23:47:52,596 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Removed node XXXX.com:8041 cluster capacity: <memory:17190912, vCores:2798>
2015-09-11 23:47:52,596 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2015-09-11 23:47:52,598 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e11_1438912039098_1027474_01_002591 Container Transitioned from RUNNING to COMPLETED
2015-09-11 23:47:52,598 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Completed container: container_e11_1438912039098_1027474_01_002591 in state: COMPLETED event:FINISHED
2015-09-11 23:47:52,598 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=my_user	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1438912039098_1027474
CONTAINERID=container_e11_1438912039098_1027474_01_002591
2015-09-11 23:47:52,598 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_e11_1438912039098_1027474_01_002591 of capacity <memory:6144,
vCores:1> on host YYYY.com:8041, which currently has 5 containers, <memory:24576, vCores:5>
used and <memory:159744, vCores:25> available, release resources=true
2015-09-11 23:47:52,598 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application attempt appattempt_1438912039098_1027474_000001 released container container_e11_1438912039098_1027474_01_002591
on node: host: YYYY.com:8041 #containers=5 available=159744 used=24576 with event: FINISHED
2015-09-11 23:47:52,599 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e11_1438912039098_1027474_01_002648 Container Transitioned from NEW to ALLOCATED
2015-09-11 23:47:52,599 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=my_user	OPERATION=AM Allocated Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1438912039098_1027474
CONTAINERID=container_e11_1438912039098_1027474_01_002648
2015-09-11 23:47:52,600 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Assigned container container_e11_1438912039098_1027474_01_002648 of capacity <memory:6144,
vCores:1> on host YYYY.com:8041, which has 6 containers, <memory:30720, vCores:6>
used and <memory:153600, vCores:24> available after allocation
2015-09-11 23:47:52,600 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2015-09-11 23:47:52,601 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e11_1438912039098_1027474_01_002649 Container Transitioned from NEW to ALLOCATED
2015-09-11 23:47:52,601 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=my_user	OPERATION=AM Allocated Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1438912039098_1027474
CONTAINERID=container_e11_1438912039098_1027474_01_002649
2015-09-11 23:47:52,601 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Assigned container container_e11_1438912039098_1027474_01_002649 of capacity <memory:6144,
vCores:1> on host YYYY.com:8041, which has 9 containers, <memory:51712, vCores:9>
used and <memory:169472, vCores:27> available after allocation
2015-09-11 23:47:52,604 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e11_1438912039098_1027474_01_002650 Container Transitioned from NEW to ALLOCATED
2015-09-11 23:47:52,604 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=my_user	OPERATION=AM Allocated Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1438912039098_1027474
CONTAINERID=container_e11_1438912039098_1027474_01_002650
2015-09-11 23:47:52,604 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Assigned container container_e11_1438912039098_1027474_01_002650 of capacity <memory:6144,
vCores:1> on host XXXX.com:8041, which has 1 containers, <memory:6144, vCores:1>
used and <memory:215040, vCores:35> available after allocation
{code}

Getting a list of nodes that was not updated 8ms after the removal seems to point to an issue
in node processing caused by far to coarse locking. I am thus not so sure that we can attribute
this to the same race condition (I know I am being really careful here). The log given above
from timestamp 47:52,589 to 47:52,604 is complete no lines removed. 

BTW: this example might have been taken from a different application since I have seen this
a number of times over the last weeks 

> FairScheduler: RM quits processing expired container from a removed node
> ------------------------------------------------------------------------
>
>                 Key: YARN-4227
>                 URL: https://issues.apache.org/jira/browse/YARN-4227
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.3.0, 2.5.0, 2.7.1
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container event is processed
causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1436927988321_1307950_01_000012
Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1436927988321_1307950_01_000012 Container Transitioned from ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp:
Completed container: container_1436927988321_1307950_01_000012 in state: EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=system_op	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1436927988321_1307950
CONTAINERID=container_1436927988321_1307950_01_000012
> 2015-10-04 21:14:01,063 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 and 2.6.0
by different customers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message