hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
Date Mon, 04 Nov 2013 18:56:19 GMT

    [ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813109#comment-13813109
] 

Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------

completely removed RECOVERED state. rest of the patch is same. Only major difference is
* Before launching new appAttempt RM will check if any of the application attempts were running
before. If so then RM will wait instead of starting a new application attempt. If no application
attempts are found to be in running (anything other than final state) state then it launch
new application attempt.
* When Node manager receives resync signal it kills all the running containers and then reports
back the killed containers to RM during RM registration. On receiving the container information
RM checks if any of the reported container is an AM container If so then sends container_failed
event to the related app attempt and eventually starts new application attempt.

> During RM restart, RM should start a new attempt only when previous attempt exits for
real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch, YARN-1210.2.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then kill them
forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after
waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with
each other. This can help issues with downstream components like Pig, Hive and Oozie during
RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for recovery.
> This can continue to be useful after work-preserving restart, so that AMs which can properly
sync back up with RM can continue to run and those that don't are guaranteed to be killed
before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message