hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing
Date Wed, 13 Jan 2016 01:21:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095389#comment-15095389

Jun Gong commented on YARN-4497:

[~jianhe] Thanks for review and comments.

for the patch, I think making below change in RMAppImpl#recover may be enough ?
There might be some problems:
1. *appState.attempts.keySet()* is not sorted by attempt ID, however we need recover them
by order because we use *currentAttempt* to get AMBlacklist and we calle *getNumFailedAppAttempts()*
in *createNewAttempt()* .
2. We need update *nextAttemptId* after recovering attempts.
3. We need to deal with the case 2 in previous comment: attempt's final state is missed(fail
to store its final state), otherwise it will cause RM to relaunch this attempt: it will be
in *LAUNCEHD* state after recover, and will time out(the attempt has already failed), then
RM will relaunch it.

> RM might fail to restart when recovering apps whose attempts are missing
> ------------------------------------------------------------------------
>                 Key: YARN-4497
>                 URL: https://issues.apache.org/jira/browse/YARN-4497
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>            Priority: Critical
>         Attachments: YARN-4497.01.patch
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing attempts in
RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored
attempt1 and attempt3, but failed to store attempt2. When RM restarts, in *RMAppImpl#recover*,
we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When
recovering attempt2, we call  *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it
will first find its ApplicationAttemptStateData, but it could not find it, an error will come
at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).

This message was sent by Atlassian JIRA

View raw message