hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
Date Tue, 15 Dec 2015 07:35:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057501#comment-15057501

Jun Gong commented on YARN-3480:

[~jianhe] Thanks for review and suggestion.

how about removing the attempts that are beyond the max-allowed-attempts instead of the ones
beyond the validity interval ? this way, we can keep more reasonable amount of history.
OK. In earlier patches, I did it in this way.  Then max-allowed-attempts will be a global
hard limit.

Instead of introducing the dummyAttempt in the RMApp, we can change the caller to always find
the current attempt for container by using AbstractYarnScheduler#getCurrentAttemptForContainer
API. This way, the container events can be routed to the current attempts instead of old one.
Current attempt might be in any state, it could not deal with some container state, e.g. when
attempt is in RMAppAttemptState.NEW, it could deal with event RMAppAttemptEventType.CONTAINER_FINISHED.
In order not to make attempt's state transition more complex, we introduce 'dummyAttempt',
it is in final state(because it is a finished attempt), e.g. RMAppAttemptState.FAILED, and
it could deal with any event RMAppAttemptEventType.*. Is it OK?

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch, YARN-3480.04.patch,
YARN-3480.05.patch, YARN-3480.06.patch
> When RM HA is enabled and running containers are kept across attempts, apps are more
likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts'
larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small
value, retried attempts might be very large. So we need to delete some attempts stored in
RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

View raw message