hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
Date Thu, 14 Nov 2013 23:23:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13823100#comment-13823100
] 

Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------

Thanks [~vinodkv]
bq. cleanupContainersOnNMResync: We are no longer making the call to getNodeStatusAndUpdateContainersInContext,
can you please put a comment as to why - I believe this is so that NodeStatusUpdater can eventually
take these statuses up when it reregisters.
yes they are used when NM re-registers with RM. added comment..

bq. use getContainerState instead of cloneAndGetContainerStatus?
They are different.

bq. Use RegisterNodeManagerRequest.newInstance() in registerWithRM?
bq. Similarly NodeStatus.newInstance, NodeHealthStatus.newInstance?
they were missing added them and fixed NodeStatusUpdater.

bq. As of now because we kill all containers it's fine, but it's better to explicitly check
for master-container's state during registration and then only send the event.
bq. Also put a comment as to why we are directly faking RMAppAttemptContainerFinishedEvent
instead of informing RMContainerImpl.
But we don't know about the container today..right?

bq. Instead of sending and ignoring ATTEMPT_FAILED at FAILED state, we can skip sending this
event by RMAppAttempt if the app was already in a final state?
Ok.. should I also remove the similar transition from FINISHED / KILLED?

address all other comments.

> During RM restart, RM should start a new attempt only when previous attempt exits for
real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch,
YARN-1210.4.patch, YARN-1210.5.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then kill them
forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after
waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with
each other. This can help issues with downstream components like Pig, Hive and Oozie during
RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for recovery.
> This can continue to be useful after work-preserving restart, so that AMs which can properly
sync back up with RM can continue to run and those that don't are guaranteed to be killed
before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message