hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
Date Wed, 14 Oct 2015 22:07:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957904#comment-14957904

Jian He commented on YARN-4000:

Forgot about my previous comment:
bq. actually, I think this will be a problem in regular case. 
Consider this scenario : 
1) application is recovered and added into scheduler, some slow NM has not re-registered back,
so those containers are not yet recovered.
2) User kills this app
3) CapacityScheduler#doneApplicationAttempt is called, containers tracked by RM so far are
killed.  Note that CapacityScheduler#doneApplication is not called, so scheduler still has
the SchedulerApplication in memory
4) Slow NM now re-registers and try to recover the containers. These containers will be recovered
even though application is in the process of being killed. These container will not be killed
later on. Hence, these containers are leaked.

> RM crashes with NPE if leaf queue becomes parent queue during restart
> ---------------------------------------------------------------------
>                 Key: YARN-4000
>                 URL: https://issues.apache.org/jira/browse/YARN-4000
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-4000.01.patch, YARN-4000.02.patch, YARN-4000.03.patch, YARN-4000.04.patch,
YARN-4000.05.patch, YARN-4000.06.patch
> This is a similar situation to YARN-2308.  If an application is active in queue A and
then the RM restarts with a changed capacity scheduler configuration where queue A becomes
a parent queue to other subqueues then the RM will crash with a NullPointerException.

This message was sent by Atlassian JIRA

View raw message