hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
Date Sat, 19 Dec 2015 00:21:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065043#comment-15065043

Jun Gong commented on YARN-3480:

Thanks for review and suggestion!

regarding this logic, it is possible that a particular attempt is not persisted in the store
because of some connection failures. so the app.nextAttemptId - app.firstAttemptIdInStateStore
does not necessarily indicate the number of attempts.
If RMStateStore fails to persist any attempt, it will transition to state 'RMStateStoreState.FENCED'.
There will be no operations performed if RMStateStore is in this state. So it will not be
a problem?

LevelDBRMStateStore#removeApplicationAttemptInternal does not need to use batch operation,
as it only has one operation

Could you also add a test case in RMStateStoreTestBase#testRMAppStateStore that the loading
part also works correctly? i.e. loading an app with partial attempts works correctly.
Thanks, I will fix them.

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch, YARN-3480.04.patch,
YARN-3480.05.patch, YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, YARN-3480.09.patch,
> When RM HA is enabled and running containers are kept across attempts, apps are more
likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts'
larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small
value, retried attempts might be very large. So we need to delete some attempts stored in
RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

View raw message