hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
Date Thu, 17 Oct 2013 23:46:44 GMT

    [ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798620#comment-13798620
] 

Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------

Summarizing current patch.
* After RMAppAttempts are recovered then all of the attempts are moved into LAUNCHED state.
After YARN-891 we will know the state of the earlier finished application attempts; so then
based on that we can decide where the current app attempt should transition to. On RECOVER
event
** It will move to LAUNCHED state if it is was the last running app attempt
** It will move to FAILED / KILLED /..other terminal application attempt state.
* When NM RESYNCs containers will be killed and then NM will re-register with RM passing already
running containers. On RM side if any of the container turns out to be earlier AM container
then we will fail that app attempt and immediately start new app attempt. However if we don't
get AM's finished containerId during furture NM register then after some time AMLivelinessMonitor
will expire and will fail the running app attempt and start a new one.


> During RM restart, RM should start a new attempt only when previous attempt exits for
real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then kill them
forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after
waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with
each other. This can help issues with downstream components like Pig, Hive and Oozie during
RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for recovery.
> This can continue to be useful after work-preserving restart, so that AMs which can properly
sync back up with RM can continue to run and those that don't are guaranteed to be killed
before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message