hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
Date Mon, 21 Aug 2017 22:26:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135938#comment-16135938
] 

Steven Rand commented on YARN-4227:
-----------------------------------

I'm seeing a similar issue on what's roughly branch-2 (CDH 5.11.0), with the error being:

{code}
2017-06-27 16:32:39,381 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
Thread[Preemption Timer,5,main] threw an Exception.
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:687)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread$PreemptContainersTask.run(FSPreemptionThread.java:230)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
{code}

This error, which causes the FSPreemptionThead to die, and thereby crashes the RM, seems to
be correlated with NodeManagers being marked unhealthy due to lack of local disk space during
large shuffles. I haven't confirmed, but presumably the unhealthy nodes are removed while
we're waiting for the lock, and no longer exist when we call {{releaseContainer}}.

I'm curious as to whether others are seeing this as well on recent versions, in which case
maybe this is worth reopening?

> FairScheduler: RM quits processing expired container from a removed node
> ------------------------------------------------------------------------
>
>                 Key: YARN-4227
>                 URL: https://issues.apache.org/jira/browse/YARN-4227
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.3.0, 2.5.0, 2.7.1
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container event is processed
causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1436927988321_1307950_01_000012
Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1436927988321_1307950_01_000012 Container Transitioned from ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp:
Completed container: container_1436927988321_1307950_01_000012 in state: EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=system_op	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1436927988321_1307950
CONTAINERID=container_1436927988321_1307950_01_000012
> 2015-10-04 21:14:01,063 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 and 2.6.0
by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message