hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
Date Tue, 22 Sep 2015 20:43:04 GMT

    [ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903400#comment-14903400

Varun Saxena commented on YARN-4000:

bq. Is this the case? I think in current code, RM is still ignoring these orphan containers?
In recoverContainersOnNode, if we do not find application in scheduler the flow in RM if I
look at trunk code is as under:
# AbstractYarnScheduler#killOrphanContainerOnNode will be called if application is not found
in scheduler, which will in turn post CLEANUP_CONTAINER event (for containers which have not
finished). This event will be handled by RMNodeImpl. Although here we will be sending one
CLEANUP_CONTAINER event for each container even though all containers for a running app will
have to be cleaned up. Maybe this can be refactored to send one event only with all the containers
for an app and node. But cleaning up a lot of containers like this maybe a rare scenario.
# Anyways going further, in RMNodeImpl, this event will be processed in CleanUpContainerTransition.
Here the container will be added to a set containersToClean.
# When heartbeat from NM comes, ResourceTrackerService#nodeHeartbeat will call RMNodeImpl#updateNodeHeartbeatResponseForCleanup.
In this method, response will be populated with containers to cleanup from the set containersToClean.
And hence these containers are reported back to NM in HB Rsp.

On NM side, flow is as under:
# In NodeStatusUpdaterImpl, these containers to cleanup will be retrieved from HB Rsp and
CMgrCompletedContainersEvent will be dispatched.
# In ContainerManagerImpl, this event will be processed and a ContainerKillEvent created for
each container. 
# Now depending on the state of the container, ContainerImpl will send a CLEANUP_CONTAINER
event to ContainersLauncher which will then send a TERM/KILL signal to container. 

> RM crashes with NPE if leaf queue becomes parent queue during restart
> ---------------------------------------------------------------------
>                 Key: YARN-4000
>                 URL: https://issues.apache.org/jira/browse/YARN-4000
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-4000.01.patch, YARN-4000.02.patch, YARN-4000.03.patch, YARN-4000.04.patch,
> This is a similar situation to YARN-2308.  If an application is active in queue A and
then the RM restarts with a changed capacity scheduler configuration where queue A becomes
a parent queue to other subqueues then the RM will crash with a NullPointerException.

This message was sent by Atlassian JIRA

View raw message