hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sandflee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2567) Add a percentage-node threshold for RM to wait for new allocations after restart/failover
Date Tue, 12 Apr 2016 07:10:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236724#comment-15236724
] 

sandflee commented on YARN-2567:
--------------------------------

there maybe one problem that if NM recovered as a finished state and NM register with running
containers, normally we should kill the container. There may some problem as below:
1, NM LOST and RM store  LOST status successfully
2, RM failover and NM recovered as LOST
3, NM register and becomes RUNNING, {color:red} but RM stores RUNNING state failed or delayed{color}
4, RM allocate container on NM, and container running on it
5, RM failover and NM recovered as LOST
6, NM register with RM,  RM killed the container on it, this is not expected

to fix this , one solution is to store NM status first, then NM becomes RUNNING,  but this
may delay the NM register for big cluster

> Add a percentage-node threshold for RM to wait for new allocations after restart/failover
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-2567
>                 URL: https://issues.apache.org/jira/browse/YARN-2567
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> This is the remaining part of YARN-2001 - to halt allocations after restart till x% of
nodes sync back with the RM. This is useful for avoiding bad scheduling during the time the
nodes are still joining back after a restart/failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message