hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
Date Sat, 11 Jan 2014 22:15:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868891#comment-13868891
] 

Bikas Saha commented on YARN-1489:
----------------------------------

We need to come to a conclusion on how to allow the containers to also find out about the
new AM's.
Something we have discussed in the past
1) New AM upon register provides an payload to the RM
2) RM syncs the payload with the NMs on heartbeat. RM-NM already sync on running application
state. This payload could piggyback on that.
3) A container on an NM could query the NM about its own AM's payload. This local API could
be secured by a local token and available to only containers running on the local node.
4) This payload would be used by the containers to reconnect with the AM (in case systems
dont use external solutions like zookeeper for such tracking.

This sounds reasonably light-weight, scalable and self-contained. All the interested parties
would be informed within 2*(NmHeartbeat) time interval.

> [Umbrella] Work-preserving ApplicationMaster restart
> ----------------------------------------------------
>
>                 Key: YARN-1489
>                 URL: https://issues.apache.org/jira/browse/YARN-1489
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two potentially can
be done at the app level, but it is good to have a common solution for all apps where-ever
possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message