hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
Date Mon, 30 Dec 2013 22:58:51 GMT

    [ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859156#comment-13859156
] 

Zhijie Shen commented on YARN-1489:
-----------------------------------

Thanks Vinod for the proposal. One thought when I read the following point.

bq. In case of apps like MapReduce where containers need to communicate directly with AMs,
the old running-containers don’t know where the new ApplicationMaster is running and how
to reach it (service addresses).

During AM restarting, the container may try to send messages to AM in some application, and
these messages may get lost. Is good to buffer the outstanding messages and send them to AM
when rebinding?

> [Umbrella] Work-preserving ApplicationMaster restart
> ----------------------------------------------------
>
>                 Key: YARN-1489
>                 URL: https://issues.apache.org/jira/browse/YARN-1489
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two potentially can
be done at the app level, but it is good to have a common solution for all apps where-ever
possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message