hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "rangjiaheng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-7377) Duplicate Containers allocated for Long-Running Application after NM lost and restart and RM restart
Date Sat, 21 Oct 2017 05:14:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

rangjiaheng updated YARN-7377:
------------------------------
    Description: 
Case:
A Spark streaming application named app1 running on yarn for a long time, app1 has a container
named c1 on a NM named nm1;
1. The NM named nm1 was lost for some reason, but the containers on it runs well; 
2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM tells app1's
AM that a container of app1 was failed because NM lost, so app1's AM killed that container
through RPC and then request a new container named c2 from RM;
3. Administrator found nm1 lost, so he restart it; since NM's recovery was enabled, NM restore
all the containers including container c1, but now c1's status is 'DONE'; A bug here: this
NM will list this container in webui forever;
4. RM restart for some reason; since RM's recovery was enabled, 





  was:
Case:



> Duplicate Containers allocated for Long-Running Application after NM lost and restart
and RM restart
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7377
>                 URL: https://issues.apache.org/jira/browse/YARN-7377
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications, nodemanager, RM, yarn
>    Affects Versions: 3.0.0-alpha3
>         Environment: Hadoop 2.7.1 RM recovery and NM recovery enabled´╝Ť
> Spark streaming application, a long-running application on yarn
>            Reporter: rangjiaheng
>              Labels: patch
>
> Case:
> A Spark streaming application named app1 running on yarn for a long time, app1 has a
container named c1 on a NM named nm1;
> 1. The NM named nm1 was lost for some reason, but the containers on it runs well; 
> 2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM tells app1's
AM that a container of app1 was failed because NM lost, so app1's AM killed that container
through RPC and then request a new container named c2 from RM;
> 3. Administrator found nm1 lost, so he restart it; since NM's recovery was enabled, NM
restore all the containers including container c1, but now c1's status is 'DONE'; A bug here:
this NM will list this container in webui forever;
> 4. RM restart for some reason; since RM's recovery was enabled, 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message