hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhangshilong (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-7214) duplicated container completed To AM
Date Tue, 19 Sep 2017 08:03:00 GMT
zhangshilong created YARN-7214:

             Summary: duplicated container completed To AM
                 Key: YARN-7214
                 URL: https://issues.apache.org/jira/browse/YARN-7214
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.0.0-alpha3, 2.7.1
         Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
            Reporter: zhangshilong

env: hadoop 2.7.1  with rm recovery and nm recovery enabled
 spark app(app1) running least one container(named c1) in NM1.
 1、NM1 crashed,and RM found NM1 expired in 10 minutes.
 2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive c1 completed
message.But RM can not send c1(to be removed) to NM1 because NM1 lost.
 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 is lost and
will not handle containers from NM1.
4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will not removed from
context of NM1.
5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. RM will
send c1 complted message to AM of app1.  So, app1 received duplicated c1. 
once spark AM   receive one container completed from RM, it will allocate one new container.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message